Release Aug 12, 2026

OpenAI GPT-5

Benchmark breakdown, practical testing notes, and competitive positioning for GPT-5. Where it leads, where it doesn't, and what it means for the frontier.

GPT-5 ships with measurable gains on reasoning-heavy benchmarks and a tighter hold on long-context fidelity. The bigger story is operational: tool-use loops finish more often, structured output is more reliable, and the model recovers from partial failures with less hand-holding.

78%
MMLU-Pro
up from 74% on GPT-4o
63%
SWE-bench Verified
best among hosted models
92%
Long-context recall
128k window, needle-in-haystack

What shipped

GPT-5 launches in two variants. The flagship targets agentic and reasoning workloads; gpt-5-mini takes over the high-volume, latency-sensitive surface that GPT-4o-mini was holding. Context is 128k tokens with materially better recall in the back third of the window.

Benchmark breakdown

Reasoning benchmarks
Benchmark GPT-5GPT-4oClaude 4
MMLU-Pro 78 74 79
GPQA Diamond 64 53 66
MATH 86 82 84
SWE-bench Verified 63 41 60

Best per row in bold. Raw scores; not normalized for prompt budget.

Source: OpenAI release notes; Anthropic published evals; reproduced runs

Agent-loop completion rate
50-step coding tasks, 3 attempts each

GPT-5 finishes more multi-step coding tasks without intervention than the prior generation, with the largest jump on tasks requiring tool retries after partial failures.

Higher is better. Tasks include greenfield, refactor, and bug-fix categories.

Source: In-house eval harness, n=120 tasks