LLM Weather Report

Tracking raw LLM reasoning drift — pure endpoint, no agents

About & Methodology

The LLM Weather Report tracks reasoning capability drift across large language models. Instead of benchmarking models against each other, it monitors whether individual models maintain consistent performance on a fixed set of reasoning tasks over time.

What This Tests

This tests the raw LLM endpoint, not an agent. Each prompt is a single API call to the model’s chat completions endpoint — no tool use, no multi-turn conversation, no retrieval, no agent scaffolding. The only system prompt is: “Answer the following question. Think step by step.”

This is intentional. We want to measure the model itself — its weights, its reasoning capability, its consistency. If a model starts failing the bat-and-ball problem, something changed in the model (weight updates, system prompt changes, quantization), not in the tooling layer around it. Agent frameworks, RAG pipelines, and tool use add their own variance. We strip all of that away to get a clean signal on the model’s raw reasoning.

Why

Model providers update their models constantly — weight updates, system prompt changes, infrastructure migrations, quantization changes. These changes can silently alter reasoning behavior. A model that correctly solved a logic puzzle yesterday might fail today. The LLM Weather Report catches these changes.

How It Works

Prompts

A fixed set of 7 reasoning prompts covers:

Prompts are deliberately simple. The goal is not to find the hardest problems but to establish a stable baseline that should always be answered correctly. When a model fails one of these, something changed.

Multi-Sample Evaluation

Each prompt is sent to each model multiple times (default: 2 samples per prompt per model). This distinguishes real drift from stochastic noise — if a model gets it right once and wrong once, that’s different from getting it wrong twice.

Judging

Each response is evaluated independently by a panel of 3 judge models for:

Correctness is determined by majority vote across all judges × samples. Scores are averaged. This cross-model judging reduces bias from any single evaluator.

Drift Detection

Each run is compared against the previous run. Four types of drift are flagged:

TypeMeaningSeverity
REGRESSIONWas correct, now incorrectHigh — capability loss
IMPROVEMENTWas incorrect, now correctPositive signal
SCORE_DROPStill correct, score fell ≥1.0Warning — reasoning degrading
SCORE_RISEStill correct, score rose ≥1.0Positive signal

Model Status

Each model gets a per-run status:

Schedule

Runs execute 6 times daily via GitHub Actions. Results are committed to the repository and deployed automatically.

Models Tracked

The current contestant and judge models are configured in models.yaml in the project repository. Models without API keys configured are automatically skipped.

Data Access

All data is available in multiple formats:

Source Code

Everything is open source at github.com/2389-research/llm-weather. Key files:

Built With

About

A 2389 Research project.