Latest Report
gpt-5.4-mini lost spatial-1. gemini-2.5-flash dropped on causality-1; gemini-2.5-flash dropped on common-sense-1. gemini-2.5-flash failing causality-1.
Drift Alerts
- REGRESSION openai/gpt-5.4-mini on spatial-1
- SCORE_DROP gemini/gemini-2.5-flash on causality-1
- SCORE_DROP gemini/gemini-2.5-flash on common-sense-1
Model Status
- → anthropic/claude-haiku-4-5 stable
- → anthropic/claude-opus-4-6 stable
- → anthropic/claude-sonnet-4-6 stable
- ↓ gemini/gemini-2.5-flash down
- → gemini/gemini-2.5-pro stable
- → openai/gpt-5.4 stable
- ↓ openai/gpt-5.4-mini down
Scorecard
| Model | ambiguity-1 | causality-1 | code-1 | common-sense-1 | logic-1 | math-1 | spatial-1 |
|---|---|---|---|---|---|---|---|
| anthropic/claude-haiku-4-5 | ✓ (4.33) | ✓ (4.67) | ✓ (4.67) | ✓ (3.33) | ✓ (5) | ✓ (5) | ✓ (5) |
| anthropic/claude-opus-4-6 | ✓ (5) | ✓ (4.67) | ✓ (4.83) | ✓ (4.33) | ✓ (5) | ✓ (5) | ✓ (5) |
| anthropic/claude-sonnet-4-6 | ✓ (4.5) | ✓ (4.83) | ✓ (4.67) | ✓ (4) | ✓ (5) | ✓ (5) | ✓ (5) |
| gemini/gemini-2.5-flash | ✓ (4.5) | ✗ (2.33)was 3.33 | ✓ (4.67) | ✓ (4)was 5 | ✓ (4.83) | ✓ (5) | ✓ (5) |
| gemini/gemini-2.5-pro | ✓ (4.83) | ✓ (5) | ✓ (4.83) | ✓ (5) | ✓ (5) | ✓ (5) | ✓ (5) |
| ollama/llama3 | — | — | — | — | — | — | — |
| openai/gpt-5.4 | ✓ (4.33) | ✓ (4.67) | ✓ (4.67) | ✓ (4.33) | ✓ (4.83) | ✓ (4.67) | ✓ (5) |
| openai/gpt-5.4-mini | ✓ (4.67) | ✓ (4.83) | ✓ (4.67) | ✓ (4.33) | ✓ (4.67) | ✓ (5) | ✗ (3.67)was ✓ (5) |
Past Reports
- Apr 11, 2026 5:12 PM gpt-5.4-mini lost spatial-1. gemini-2.5-flash dropped on causality-1; gemini-2.5-flash dropped on common-sense-1. gemini-2.5-flash failing causality-1.
- Apr 11, 2026 12:15 PM gemini-2.5-flash lost causality-1. gpt-5.4-mini recovering. gemini-2.5-flash scores rising.
- Apr 11, 2026 12:11 PM gpt-5.4-mini lost spatial-1. gemini-2.5-flash dropped on common-sense-1. gemini-2.5-flash recovering.
For Agents
- llms.txt — plain text index of all runs
- Agent Skill — how to read and interpret this data