LLM Weather Report

Tracking raw LLM reasoning drift — pure endpoint, no agents

gpt-5.4-mini failing spatial-1; gemini-2.5-flash failing causality-1. claude-sonnet-4-6, claude-haiku-4-5 recovering. gpt-5.4 scores rising.

April 21, 2026 — 12:28 PM CT

Drift Alerts

SCORE_RISE openai/gpt-5.4 on math-1
IMPROVEMENT anthropic/claude-sonnet-4-6 on common-sense-1
IMPROVEMENT anthropic/claude-haiku-4-5 on common-sense-1

Provider Status

OpenAI Some users may encounter issues with GPT-5.4-C model in Codex
OpenAI Some users will see higher error rates on Codex
OpenAI Users may encounter issue with ChatGPT Business after upgrade or adding new seats for up to an hour
OpenAI Elevated errors for ChatGPT conversations in Europe
OpenAI Users unable to load ChatGPT, Codex and API Platform
Anthropic Elevated errors for uploading files
Anthropic Claude Sonnet 4.5 error spike
Anthropic Elevated errors on Opus 4.6

Scorecard

Model	ambiguity-1	causality-1	code-1	common-sense-1	logic-1	math-1	spatial-1
anthropic/claude-haiku-4-5	✓ (4.17)	✓ (4.5)	✓ (4.5)	✓ (3.17)was ✗ (2.75)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-opus-4-6	✓ (4.83)	✓ (4.67)	✓ (4.67)	✓ (4.33)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-sonnet-4-6	✓ (4.5)	✓ (4.83)	✓ (4.33)	✓ (3.67)was ✗ (3)	✓ (5)	✓ (5)	✓ (5)
gemini/gemini-2.5-flash	✓ (4.5)	✗ (2.33)	✓ (4.83)	✓ (3.83)	✓ (4.83)	✓ (5)	✓ (5)
gemini/gemini-2.5-pro	✓ (4.5)	✓ (5)	✓ (4.5)	✓ (4.33)	✓ (5)	✓ (5)	✓ (4.83)
ollama/llama3	—	—	—	—	—	—	—
openai/gpt-5.4	✓ (4.33)	✓ (4.83)	✓ (4.67)	✓ (4.33)	✓ (5)	✓ (5)was 3.8	✓ (5)
openai/gpt-5.4-mini	✓ (4.67)	✓ (5)	✓ (4.67)	✓ (4.33)	✓ (4.83)	✓ (4.67)	✗ (3.67)

Model Status

↑ anthropic/claude-haiku-4-5 up
→ anthropic/claude-opus-4-6 stable
↑ anthropic/claude-sonnet-4-6 up
→ gemini/gemini-2.5-flash stable
→ gemini/gemini-2.5-pro stable
↑ openai/gpt-5.4 up
→ openai/gpt-5.4-mini stable

Raw Data

Detail log — full responses and judge verdicts per prompt
JSON — structured data for programmatic access
Markdown — plain text report
responses.json — raw model outputs
judgments.json — raw judge verdicts
run.log — debug log
Agent Skill — how to read and interpret this data
Methodology — how evaluations work