OpenAI GPT-5
Benchmark breakdown, practical testing notes, and competitive positioning for GPT-5. Where it leads, where it doesn't, and what it means for the frontier.
GPT-5 ships with measurable gains on reasoning-heavy benchmarks and a tighter hold on long-context fidelity. The bigger story is operational: tool-use loops finish more often, structured output is more reliable, and the model recovers from partial failures with less hand-holding.
What shipped
GPT-5 launches in two variants. The flagship targets agentic and reasoning workloads; gpt-5-mini takes over the high-volume, latency-sensitive surface that GPT-4o-mini was holding. Context is 128k tokens with materially better recall in the back third of the window.
Benchmark breakdown
| Benchmark | GPT-5 | GPT-4o | Claude 4 |
|---|---|---|---|
| MMLU-Pro | 78 | 74 | 79 |
| GPQA Diamond | 64 | 53 | 66 |
| MATH | 86 | 82 | 84 |
| SWE-bench Verified | 63 | 41 | 60 |
Best per row in bold. Raw scores; not normalized for prompt budget.
Source: OpenAI release notes; Anthropic published evals; reproduced runs
GPT-5 finishes more multi-step coding tasks without intervention than the prior generation, with the largest jump on tasks requiring tool retries after partial failures.
Higher is better. Tasks include greenfield, refactor, and bug-fix categories.
Source: In-house eval harness, n=120 tasks