About & Methodology
The LLM Weather Report tracks reasoning capability drift across large language models. Instead of benchmarking models against each other, it monitors whether individual models maintain consistent performance on a fixed set of reasoning tasks over time.
What This Tests
This tests the raw LLM endpoint, not an agent. Each prompt is a single API call to the model’s chat completions endpoint — no tool use, no multi-turn conversation, no retrieval, no agent scaffolding. The only system prompt is: “Answer the following question. Think step by step.”
This is intentional. We want to measure the model itself — its weights, its reasoning capability, its consistency. If a model starts failing the bat-and-ball problem, something changed in the model (weight updates, system prompt changes, quantization), not in the tooling layer around it. Agent frameworks, RAG pipelines, and tool use add their own variance. We strip all of that away to get a clean signal on the model’s raw reasoning.
Why
Model providers update their models constantly — weight updates, system prompt changes, infrastructure migrations, quantization changes. These changes can silently alter reasoning behavior. A model that correctly solved a logic puzzle yesterday might fail today. The LLM Weather Report catches these changes.
How It Works
Prompts
A fixed set of 7 reasoning prompts covers:
- Logic — syllogistic and deductive reasoning
- Math — arithmetic and word problems
- Spatial reasoning — directional and relational questions
- Causality — cause-and-effect chain reasoning
- Code comprehension — reading and tracing code
- Ambiguity resolution — handling vague or underspecified questions
- Common sense — everyday physical and social reasoning
Prompts are deliberately simple. The goal is not to find the hardest problems but to establish a stable baseline that should always be answered correctly. When a model fails one of these, something changed.
Multi-Sample Evaluation
Each prompt is sent to each model multiple times (default: 2 samples per prompt per model). This distinguishes real drift from stochastic noise — if a model gets it right once and wrong once, that’s different from getting it wrong twice.
Judging
Each response is evaluated independently by a panel of 3 judge models for:
- Correctness (boolean) — is the answer right?
- Reasoning quality (1–5 score) — how good is the reasoning?
Correctness is determined by majority vote across all judges × samples. Scores are averaged. This cross-model judging reduces bias from any single evaluator.
Drift Detection
Each run is compared against the previous run. Four types of drift are flagged:
| Type | Meaning | Severity |
|---|---|---|
| REGRESSION | Was correct, now incorrect | High — capability loss |
| IMPROVEMENT | Was incorrect, now correct | Positive signal |
| SCORE_DROP | Still correct, score fell ≥1.0 | Warning — reasoning degrading |
| SCORE_RISE | Still correct, score rose ≥1.0 | Positive signal |
Model Status
Each model gets a per-run status:
- → stable — no drift detected
- ↑ up — improving (correctness gained or score rising)
- ↓ down — regressing (correctness lost or score dropping). Down overrides up — if a model both improved and regressed in the same run, it’s marked down.
Schedule
Runs execute 6 times daily via GitHub Actions. Results are committed to the repository and deployed automatically.
Models Tracked
The current contestant and judge models are configured in models.yaml in the project repository. Models without API keys configured are automatically skipped.
Data Access
All data is available in multiple formats:
- HTML — human-readable pages at each run URL
- Markdown —
/runs/<id>/report.mdfor each run - JSON —
/runs/<id>/data.jsonwith full scorecard, drift, status, and previous run data - llms.txt —
/llms.txtplain text index of all runs - Agent skill —
/skill.mdinstructions for AI agents consuming this data - Source — full run data in the GitHub repository
Source Code
Everything is open source at github.com/2389-research/llm-weather. Key files:
- prompts.yaml — the 7 reasoning prompts
- models.yaml — contestant and judge model configuration
- runner.py — sends prompts to models via LiteLLM (
completion()call, single system prompt) - judge.py — evaluates each response for correctness and quality
- report.py — builds scorecard, detects drift, generates headlines
- weather.yml — GitHub Actions workflow (cron schedule)
- runs/ — raw JSON data for every run
Built With
- LiteLLM — unified API for all model providers
- Hugo — static site generation
- GitHub Actions — scheduled execution
- Netlify — hosting and deployment
About
A 2389 Research project.