FrontierMath Across Frontier Models
How GPT-5, Claude 4, and Gemini 2.5 perform on advanced mathematical reasoning under controlled conditions.
FrontierMath stress-tests reasoning at the edge of current capability. Problems are graduate-level mathematics across number theory, combinatorics, and algebraic geometry, and they are explicitly designed to resist memorization.
Test setup
Results
| Model | Pass rate | Avg tokens | Time (s) |
|---|---|---|---|
| GPT-5 | 32% | 4200 | 38 |
| Claude 4 | 35% | 5100 | 44 |
| Gemini 2.5 Pro | 28% | 3700 | 32 |
Pass rate is exact-answer match. Tokens and time are per-attempt averages across all problems.
Source: In-house run, n=200 problems
Caveats
These numbers reflect a single run on the public subset only. Models gain substantially when given chain-of-thought prompting, scratchpads, or self-consistency sampling — none of which are applied here. Production deployments that rely on math reasoning should evaluate with the prompting strategy they actually use.