Benchmark Jul 15, 2026
LLM

FrontierMath Across Frontier Models

How GPT-5, Claude 4, and Gemini 2.5 perform on advanced mathematical reasoning under controlled conditions.

FrontierMath stress-tests reasoning at the edge of current capability. Problems are graduate-level mathematics across number theory, combinatorics, and algebraic geometry, and they are explicitly designed to resist memorization.

Test setup

Results

FrontierMath public subset
Model Pass rateAvg tokensTime (s)
GPT-5 32% 4200 38
Claude 4 35% 5100 44
Gemini 2.5 Pro 28% 3700 32

Pass rate is exact-answer match. Tokens and time are per-attempt averages across all problems.

Source: In-house run, n=200 problems

Caveats

These numbers reflect a single run on the public subset only. Models gain substantially when given chain-of-thought prompting, scratchpads, or self-consistency sampling — none of which are applied here. Production deployments that rely on math reasoning should evaluate with the prompting strategy they actually use.