gpt-5.4 lost logic-1; gpt-5.4 lost math-1; gpt-5.4 lost spatial-1; gpt-5.4 lost causality-1; gpt-5.4 lost code-1; gpt-5.4 lost ambiguity-1; gpt-5.4 lost common-sense-1; gpt-5.4-mini lost math-1; gpt-5.4-mini lost causality-1; gpt-5.4-mini lost code-1; gpt-5.4-mini lost ambiguity-1; gpt-5.4-mini lost common-sense-1. gpt-5.4-mini recovering. gemini-2.5-flash scores rising.
May 23, 2026 — 12:27 PM CT
Drift Alerts
- REGRESSION openai/gpt-5.4 on logic-1
- REGRESSION openai/gpt-5.4 on math-1
- REGRESSION openai/gpt-5.4 on spatial-1
- REGRESSION openai/gpt-5.4 on causality-1
- REGRESSION openai/gpt-5.4 on code-1
- REGRESSION openai/gpt-5.4 on ambiguity-1
- REGRESSION openai/gpt-5.4 on common-sense-1
- REGRESSION openai/gpt-5.4-mini on math-1
- IMPROVEMENT openai/gpt-5.4-mini on spatial-1
- REGRESSION openai/gpt-5.4-mini on causality-1
- REGRESSION openai/gpt-5.4-mini on code-1
- REGRESSION openai/gpt-5.4-mini on ambiguity-1
- REGRESSION openai/gpt-5.4-mini on common-sense-1
- SCORE_RISE gemini/gemini-2.5-flash on causality-1
Provider Status
Scorecard
| Model | ambiguity-1 | causality-1 | code-1 | common-sense-1 | logic-1 | math-1 | spatial-1 |
|---|---|---|---|---|---|---|---|
| anthropic/claude-haiku-4-5 | ✓ (4) | ✓ (4.5) | ✓ (4.5) | ✓ (4) | ✓ (5) | ✓ (5) | ✓ (5) |
| anthropic/claude-opus-4-6 | ✓ (5) | ✓ (4.75) | ✓ (4.5) | ✓ (4) | ✓ (5) | ✓ (5) | ✓ (5) |
| anthropic/claude-sonnet-4-6 | ✓ (4.75) | ✓ (4.75) | ✓ (4.75) | ✓ (4.25) | ✓ (5) | ✓ (5) | ✓ (5) |
| gemini/gemini-2.5-flash | ✓ (4.25) | ✓ (5)was 3.83 | ✓ (5) | ✓ (5) | ✓ (5) | ✓ (5) | ✓ (5) |
| gemini/gemini-2.5-pro | ✓ (4.5) | ✓ (4.75) | ✓ (4.75) | ✓ (5) | ✓ (5) | ✓ (5) | ✓ (5) |
| ollama/llama3 | — | — | — | — | — | — | — |
| openai/gpt-5.4 | — | — | — | — | — | — | — |
| openai/gpt-5.4-mini | — | — | — | — | ✓ (5) | — | — |
Model Status
- → anthropic/claude-haiku-4-5 stable
- → anthropic/claude-opus-4-6 stable
- → anthropic/claude-sonnet-4-6 stable
- ↑ gemini/gemini-2.5-flash up
- → gemini/gemini-2.5-pro stable
- ↓ openai/gpt-5.4-mini down
Raw Data
- Detail log — full responses and judge verdicts per prompt
- JSON — structured data for programmatic access
- Markdown — plain text report
- responses.json — raw model outputs
- judgments.json — raw judge verdicts
- run.log — debug log
- Agent Skill — how to read and interpret this data
- Methodology — how evaluations work