FrontierMath Across Frontier Models

FrontierMath stress-tests reasoning at the edge of current capability. Problems are graduate-level mathematics across number theory, combinatorics, and algebraic geometry, and they are explicitly designed to resist memorization.

Test setup

Results

FrontierMath public subset

Model	Pass rate	Avg tokens	Time (s)
GPT-5	32%	4200	38
Claude 4	35%	5100	44
Gemini 2.5 Pro	28%	3700	32

Pass rate is exact-answer match. Tokens and time are per-attempt averages across all problems.

Source: In-house run, n=200 problems

Caveats

These numbers reflect a single run on the public subset only. Models gain substantially when given chain-of-thought prompting, scratchpads, or self-consistency sampling — none of which are applied here. Production deployments that rely on math reasoning should evaluate with the prompting strategy they actually use.

Test setup

Results

Caveats

Related posts