Article

Claude 5 Breaks AI Reasoning Ceiling with Record GPQA Diamond Score

Anthropic’s Claude 5 has achieved 87.3% on the GPQA Diamond benchmark ” — reportedly the first time any AI system has exceeded 85% on what’s considered one of the hardest reasoning tests available. The benchmark results from March 3 represent an 8.1 percentage point jump over the previous record, equivalent to roughly four years of prior progress compressed into a single model update.

Claude 5 benchmark results
The first AI to break 85% on GPQA Diamond reasoning benchmark.

What Makes This Different

GPQA Diamond doesn’t test pattern matching. Each question requires genuine scientific reasoning in biology, chemistry, physics, or mathematics, with questions that take PhDs 2-3 hours to answer correctly. Questions include plausible but incorrect “distractor” answers and cannot be solved through memorization.

ModelGPQA Diamond Score
Claude 5 Opus87.3%
GPT-581.1%
Previous record79.2%
Gemini 3 Pro78.4%
Claude 4.5 Opus74.8%

Extended Thinking Made the Difference

Standard Claude 5 mode scored 72.1%. Extended Thinking mode ” — the paid reasoning feature ” — jumped to 87.3%. That 15-point improvement came from inference-time reasoning optimization, not additional training data or larger model size.

Anthropic’s Chief Scientist commented: “This breakthrough confirms our thesis: reasoning is learnable, and scale alone was never the path forward.”

The Catch

Extended Thinking requires 40-50x more tokens, significantly increasing costs. And with human expert agreement on GPQA at only 87.9%, we may be approaching a ceiling.

OpenAI quickly released benchmark results from an unreleased GPT-5.1 model claiming 85.7%, while Google committed to GPQA Diamond focus in upcoming Gemini updates.

Source: Anthropic, March 3, 2026