OpenAI GPT-5 | Ron Edgecomb

GPT-5 ships with measurable gains on reasoning-heavy benchmarks and a tighter hold on long-context fidelity. The bigger story is operational: tool-use loops finish more often, structured output is more reliable, and the model recovers from partial failures with less hand-holding.

78%

MMLU-Pro

up from 74% on GPT-4o

63%

SWE-bench Verified

best among hosted models

92%

Long-context recall

128k window, needle-in-haystack

What shipped

GPT-5 launches in two variants. The flagship targets agentic and reasoning workloads; gpt-5-mini takes over the high-volume, latency-sensitive surface that GPT-4o-mini was holding. Context is 128k tokens with materially better recall in the back third of the window.

Benchmark breakdown

Reasoning benchmarks

Benchmark	GPT-5	GPT-4o	Claude 4
MMLU-Pro	78	74	79
GPQA Diamond	64	53	66
MATH	86	82	84
SWE-bench Verified	63	41	60

Best per row in bold. Raw scores; not normalized for prompt budget.

Source: OpenAI release notes; Anthropic published evals; reproduced runs

Agent-loop completion rate

50-step coding tasks, 3 attempts each

GPT-5 finishes more multi-step coding tasks without intervention than the prior generation, with the largest jump on tasks requiring tool retries after partial failures.

GPT-5 — Greenfield

78 %

GPT-4o — Greenfield

65 %

GPT-5 — Refactor

71 %

GPT-4o — Refactor

54 %

GPT-5 — Bug-fix

66 %

GPT-4o — Bug-fix

49 %

Higher is better. Tasks include greenfield, refactor, and bug-fix categories.

Source: In-house eval harness, n=120 tasks

What shipped

Benchmark breakdown

Related posts