LLM Weather Report

Tracking raw LLM reasoning drift — pure endpoint, no agents

Latest Report

June 10, 2026 — 1:03 PM CT

gemini-2.5-flash dropped on common-sense-1. gemini-2.5-flash failing causality-1. gpt-5.4-mini recovering.

Drift Alerts

Model Status

Provider Status

Scorecard

Modelambiguity-1causality-1code-1common-sense-1logic-1math-1spatial-1
anthropic/claude-haiku-4-5✓ (4.33)✓ (4.67)✓ (4.67)✓ (3)✓ (5)✓ (5)✓ (5)
anthropic/claude-opus-4-6✓ (5)✓ (4.83)✓ (4.83)✓ (4.33)✓ (5)✓ (5)✓ (5)
anthropic/claude-sonnet-4-6✓ (4.33)✓ (4.83)✓ (4.5)✓ (3.5)✓ (5)✓ (5)✓ (5)
gemini/gemini-2.5-flash✓ (4.33)✗ (2.17)✓ (4.83)✓ (3)was 4.33✓ (4.83)✓ (5)✓ (5)
gemini/gemini-2.5-pro✓ (4.83)✓ (4.83)✓ (4.83)✓ (5)✓ (5)✓ (5)✓ (5)
ollama/llama3
openai/gpt-5.4✓ (4.5)✓ (5)✓ (4.67)✓ (4.33)✓ (4.67)✓ (4.67)✓ (5)
openai/gpt-5.4-mini✓ (5)✓ (4.83)✓ (4.67)✓ (4.33)✓ (5)✓ (5)✓ (5)was ✗ (3.5)

Score History

Past Reports

About This Project

Model providers push updates all the time. Weight tweaks, system prompt changes, quantization experiments, infrastructure swaps. Almost none of it gets announced. You just wake up one morning and GPT handles a logic puzzle differently than it did last week.

We got tired of not knowing when this happened. So we built something simple: six times a day, we send the same seven reasoning prompts to a handful of major models and score what comes back. Same prompts, same grading, every four hours. When something changes, we publish it here.

This isn’t a leaderboard. We don’t care which model is “best.” We care whether each model is the same as it was yesterday. Think of it as a weather report: not a competition, just a reading of current conditions.

We also test the raw endpoint, not an agent stack. One system prompt, one user message, one API call. No tools, no retrieval, no chain-of-thought scaffolding. If a model starts failing a basic syllogism, that’s the model, not some middleware bug.

Everything is open. Raw responses, judge verdicts, scorecard data, source code. If you want to check our work or run your own analysis, it’s all on GitHub.

For Agents

Stay Updated

Get notified when models drift. Join the 2389 mailing list for updates on this project and what we're building. We only use your email for project updates — no spam, unsubscribe anytime.