LLM Weather Report

Tracking raw LLM reasoning drift — pure endpoint, no agents

gpt-5.4-mini dropped on spatial-1; gemini-2.5-flash dropped on causality-1. gpt-5.4-mini failing spatial-1. gemini-2.5-flash scores rising.

May 13, 2026 — 8:49 AM CT

Drift Alerts

SCORE_DROP openai/gpt-5.4-mini on spatial-1
SCORE_DROP gemini/gemini-2.5-flash on causality-1
SCORE_RISE gemini/gemini-2.5-flash on common-sense-1

Provider Status

OpenAI Codex 5.5 engines are experiencing high error rate
OpenAI Realtime API - SIP/WebRTC flow are down
OpenAI Elevated error rates with GPT 5.5
Anthropic Claude.ai is experiencing elevated error rates
Anthropic Claude.ai is experiencing elevated error rates
Anthropic Elevated errors on Claude Opus 4.7
Anthropic Elevated errors for Claude Sonnet 4.6 and Haiku 4.5
Anthropic Elevated Error Rate for Vaults and Credentials

Scorecard

Model	ambiguity-1	causality-1	code-1	common-sense-1	logic-1	math-1	spatial-1
anthropic/claude-haiku-4-5	✓ (4.33)	✓ (4.67)	✓ (4.67)	✓ (3.17)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-opus-4-6	✓ (5)	✓ (4.83)	✓ (4.67)	✓ (4.33)	✓ (4.83)	✓ (5)	✓ (5)
anthropic/claude-sonnet-4-6	✓ (4.5)	✓ (5)	✓ (4.67)	✓ (4.33)	✓ (5)	✓ (5)	✓ (5)
gemini/gemini-2.5-flash	✓ (4.83)	✓ (3.83)was 5	✓ (4.67)	✓ (4.33)was 3	✓ (5)	✓ (5)	✓ (5)
gemini/gemini-2.5-pro	✓ (4.33)	✓ (4.83)	✓ (4.67)	✓ (5)	✓ (5)	✓ (5)	✓ (5)
ollama/llama3	—	—	—	—	—	—	—
openai/gpt-5.4	✓ (4.5)	✓ (4.83)	✓ (4.67)	✓ (4.5)	✓ (4.67)	✓ (5)	✓ (5)
openai/gpt-5.4-mini	✓ (4.5)	✓ (4.5)	✓ (4.83)	✓ (4.5)	✓ (4.67)	✓ (4.67)	✗ (2.33)was 3.83

Model Status

→ anthropic/claude-haiku-4-5 stable
→ anthropic/claude-opus-4-6 stable
→ anthropic/claude-sonnet-4-6 stable
↓ gemini/gemini-2.5-flash down
→ gemini/gemini-2.5-pro stable
→ openai/gpt-5.4 stable
↓ openai/gpt-5.4-mini down

Raw Data

Detail log — full responses and judge verdicts per prompt
JSON — structured data for programmatic access
Markdown — plain text report
responses.json — raw model outputs
judgments.json — raw judge verdicts
run.log — debug log
Agent Skill — how to read and interpret this data
Methodology — how evaluations work