LLM Weather Report

Tracking raw LLM reasoning drift — pure endpoint, no agents

gemini-2.5-flash lost causality-1. gpt-5.4-mini dropped on spatial-1. gpt-5.4-mini failing spatial-1.

May 29, 2026 — 8:43 PM CT

Drift Alerts

SCORE_DROP openai/gpt-5.4-mini on spatial-1
REGRESSION gemini/gemini-2.5-flash on causality-1

Provider Status

OpenAI Users May Experience Issues Accessing ChatGPT
OpenAI Users may encounter issues with conversations
OpenAI Users may encounter issues logging in or creating an account
OpenAI Business plan subscription checkout issues on web and mobile web
OpenAI Codex Context Compaction Latency
OpenAI Subscription checkout failing
OpenAI Android ChatGPT Business users having trouble switching workspaces
Anthropic Elevated errors for Claude Opus 4.8
Anthropic Elevated errors on Claude Opus 4.8
Anthropic Billing and subscription management issues
Anthropic Elevated errors on Claude Opus 4.7

Scorecard

Model	ambiguity-1	causality-1	code-1	common-sense-1	logic-1	math-1	spatial-1
anthropic/claude-haiku-4-5	✓ (4.17)	✓ (4.67)	✓ (5)	✓ (3.17)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-opus-4-6	✓ (5)	✓ (4.67)	✓ (4.83)	✓ (4.5)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-sonnet-4-6	✓ (4.33)	✓ (4.83)	✓ (4.67)	✓ (3.67)	✓ (5)	✓ (5)	✓ (5)
gemini/gemini-2.5-flash	✓ (4.67)	✗ (3.5)was ✓ (5)	✓ (4.67)	✓ (4.83)	✓ (5)	✓ (5)	✓ (5)
gemini/gemini-2.5-pro	✓ (4.33)	✓ (4.83)	✓ (4.67)	✓ (4.83)	✓ (5)	✓ (5)	✓ (5)
ollama/llama3	—	—	—	—	—	—	—
openai/gpt-5.4	✓ (4.5)	✓ (4.83)	✓ (4.67)	✓ (4.33)	✓ (4.67)	✓ (5)	✓ (5)
openai/gpt-5.4-mini	✓ (4.67)	✓ (4.67)	✓ (4.67)	✓ (4.5)	✓ (5)	✓ (5)	✗ (2)was 3.83

Model Status

→ anthropic/claude-haiku-4-5 stable
→ anthropic/claude-opus-4-6 stable
→ anthropic/claude-sonnet-4-6 stable
↓ gemini/gemini-2.5-flash down
→ gemini/gemini-2.5-pro stable
→ openai/gpt-5.4 stable
↓ openai/gpt-5.4-mini down

Raw Data

Detail log — full responses and judge verdicts per prompt
JSON — structured data for programmatic access
Markdown — plain text report
responses.json — raw model outputs
judgments.json — raw judge verdicts
run.log — debug log
Agent Skill — how to read and interpret this data
Methodology — how evaluations work