# LLM Weather Report

> Tracking raw LLM reasoning drift — pure endpoint, no agents. A 2389 Research project.

Each prompt is a single API call to the model's chat completions endpoint — no tool use, no multi-turn, no agent scaffolding. Tests the model's raw reasoning capability and tracks changes over time.

## Agent Skill

For detailed instructions on how to read and interpret LLM Weather data, fetch the agent skill page:
- /skill.md

## Runs

Each run tests models on 7 reasoning prompts (logic, math, spatial, causality, code, ambiguity, common sense) and evaluates responses individually for correctness (boolean) and reasoning quality (1-5 score).

- [2026-04-11T22-12-53](/runs/2026-04-11t22-12-53/): gpt-5.4-mini lost spatial-1. gemini-2.5-flash dropped on causality-1; gemini-2.5-flash dropped on common-sense-1. gemini-2.5-flash failing causality-1.
  - Markdown: /runs/2026-04-11t22-12-53/report.md
  - JSON: /runs/2026-04-11t22-12-53/data.json
- [2026-04-11T22-12-53 — Detail](/runs/2026-04-11t22-12-53-detail/): 
  - Markdown: /runs/2026-04-11t22-12-53-detail/report.md
  - JSON: /runs/2026-04-11t22-12-53-detail/data.json
- [2026-04-11T17-15-45](/runs/2026-04-11t17-15-45/): gemini-2.5-flash lost causality-1. gpt-5.4-mini recovering. gemini-2.5-flash scores rising.
  - Markdown: /runs/2026-04-11t17-15-45/report.md
  - JSON: /runs/2026-04-11t17-15-45/data.json
- [2026-04-11T17-15-45 — Detail](/runs/2026-04-11t17-15-45-detail/): 
  - Markdown: /runs/2026-04-11t17-15-45-detail/report.md
  - JSON: /runs/2026-04-11t17-15-45-detail/data.json
- [2026-04-11T17-11-10](/runs/2026-04-11t17-11-10/): gpt-5.4-mini lost spatial-1. gemini-2.5-flash dropped on common-sense-1. gemini-2.5-flash recovering.
  - Markdown: /runs/2026-04-11t17-11-10/report.md
  - JSON: /runs/2026-04-11t17-11-10/data.json
- [2026-04-11T17-11-10 — Detail](/runs/2026-04-11t17-11-10-detail/): 
  - Markdown: /runs/2026-04-11t17-11-10-detail/report.md
  - JSON: /runs/2026-04-11t17-11-10-detail/data.json