LLM Weather Report

Tracking raw LLM reasoning drift — pure endpoint, no agents

gpt-5.4-mini lost spatial-1; gemini-2.5-pro lost causality-1; gemini-2.5-pro lost code-1. gemini-2.5-flash dropped on causality-1. gemini-2.5-flash failing causality-1.

June 4, 2026 — 1:36 AM CT

Drift Alerts

REGRESSION openai/gpt-5.4-mini on spatial-1
REGRESSION gemini/gemini-2.5-pro on causality-1
REGRESSION gemini/gemini-2.5-pro on code-1
SCORE_DROP gemini/gemini-2.5-flash on causality-1

Provider Status

OpenAI Increased latency for Codex compaction for a subset of users
OpenAI Elevated error rates on Codex, ChatGPT and Responses API
OpenAI Elevated errors for ChatGPT Pro
OpenAI codex-gpt-image-2-does-not-exist-errors
OpenAI Guest users are experiencing elevated error rate when using ChatGPT conversations.
Anthropic Elevated errors on Opus 4.7
Anthropic Issue affecting some Claude Code services
Anthropic Elevated errors on multiple models

Scorecard

Model	ambiguity-1	causality-1	code-1	common-sense-1	logic-1	math-1	spatial-1
anthropic/claude-haiku-4-5	✓ (4.33)	✓ (4.67)	✓ (4.33)	✓ (3.33)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-opus-4-6	✓ (5)	✓ (4.83)	✓ (4.67)	✓ (4.67)	✓ (5)	✓ (5)	✓ (5)
anthropic/claude-sonnet-4-6	✓ (4.83)	✓ (4.83)	✓ (4.5)	✓ (4.33)	✓ (5)	✓ (5)	✓ (5)
gemini/gemini-2.5-flash	✓ (4.67)	✗ (2)was 3.33	✓ (4.67)	✓ (4)	✓ (5)	✓ (5)	✓ (5)
gemini/gemini-2.5-pro	✓ (4.67)	—	—	✓ (4.83)	✓ (5)	✓ (5)	✓ (5)
ollama/llama3	—	—	—	—	—	—	—
openai/gpt-5.4	✓ (4.5)	✓ (5)	✓ (4.67)	✓ (4.33)	✓ (5)	✓ (5)	✓ (4.33)
openai/gpt-5.4-mini	✓ (4.5)	✓ (4.6)	✓ (4.67)	✓ (4.5)	✓ (4.83)	✓ (5)	✗ (2.67)was ✓ (5)

Model Status

→ anthropic/claude-haiku-4-5 stable
→ anthropic/claude-opus-4-6 stable
→ anthropic/claude-sonnet-4-6 stable
↓ gemini/gemini-2.5-flash down
↓ gemini/gemini-2.5-pro down
→ openai/gpt-5.4 stable
↓ openai/gpt-5.4-mini down

Raw Data

Detail log — full responses and judge verdicts per prompt
JSON — structured data for programmatic access
Markdown — plain text report
responses.json — raw model outputs
judgments.json — raw judge verdicts
run.log — debug log
Agent Skill — how to read and interpret this data
Methodology — how evaluations work