LLM Weather Report

Tracking raw LLM reasoning drift — pure endpoint, no agents

gpt-5.4 lost logic-1; gpt-5.4 lost math-1; gpt-5.4 lost spatial-1; gpt-5.4 lost causality-1; gpt-5.4 lost code-1; gpt-5.4 lost ambiguity-1; gpt-5.4 lost common-sense-1; gpt-5.4-mini lost math-1; gpt-5.4-mini lost causality-1; gpt-5.4-mini lost code-1; gpt-5.4-mini lost ambiguity-1; gpt-5.4-mini lost common-sense-1. gpt-5.4-mini recovering. gemini-2.5-flash scores rising.

May 23, 2026 — 12:27 PM CT

Drift Alerts

REGRESSION openai/gpt-5.4 on logic-1
REGRESSION openai/gpt-5.4 on math-1
REGRESSION openai/gpt-5.4 on spatial-1
REGRESSION openai/gpt-5.4 on causality-1
REGRESSION openai/gpt-5.4 on code-1
REGRESSION openai/gpt-5.4 on ambiguity-1
REGRESSION openai/gpt-5.4 on common-sense-1
REGRESSION openai/gpt-5.4-mini on math-1
IMPROVEMENT openai/gpt-5.4-mini on spatial-1
REGRESSION openai/gpt-5.4-mini on causality-1
REGRESSION openai/gpt-5.4-mini on code-1
REGRESSION openai/gpt-5.4-mini on ambiguity-1
REGRESSION openai/gpt-5.4-mini on common-sense-1
SCORE_RISE gemini/gemini-2.5-flash on causality-1

Provider Status

OpenAI Increase in users hitting Codex rate limits
OpenAI Elevated latency and error rates for ChatGPT 5.5 Thinking
OpenAI Elevated error rates on ChatGPT paid plans
Anthropic Elevated errors on Claude Opus 4.7
Anthropic Elevated error rate on multiple models
Anthropic Elevated errors on Claude.ai

Scorecard

Model	ambiguity-1	causality-1	code-1	common-sense-1	logic-1	math-1	spatial-1
anthropic/claude-haiku-4-5	✓ (4)	✓ (4.5)	✓ (4.5)	✓ (4)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-opus-4-6	✓ (5)	✓ (4.75)	✓ (4.5)	✓ (4)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-sonnet-4-6	✓ (4.75)	✓ (4.75)	✓ (4.75)	✓ (4.25)	✓ (5)	✓ (5)	✓ (5)
gemini/gemini-2.5-flash	✓ (4.25)	✓ (5)was 3.83	✓ (5)	✓ (5)	✓ (5)	✓ (5)	✓ (5)
gemini/gemini-2.5-pro	✓ (4.5)	✓ (4.75)	✓ (4.75)	✓ (5)	✓ (5)	✓ (5)	✓ (5)
ollama/llama3	—	—	—	—	—	—	—
openai/gpt-5.4	—	—	—	—	—	—	—
openai/gpt-5.4-mini	—	—	—	—	✓ (5)	—	—

Model Status

→ anthropic/claude-haiku-4-5 stable
→ anthropic/claude-opus-4-6 stable
→ anthropic/claude-sonnet-4-6 stable
↑ gemini/gemini-2.5-flash up
→ gemini/gemini-2.5-pro stable
↓ openai/gpt-5.4-mini down

Raw Data

Detail log — full responses and judge verdicts per prompt
JSON — structured data for programmatic access
Markdown — plain text report
responses.json — raw model outputs
judgments.json — raw judge verdicts
run.log — debug log
Agent Skill — how to read and interpret this data
Methodology — how evaluations work