Claude 5 Breaks AI Reasoning Ceiling with Record GPQA Diamond Score
Anthropic’s Claude 5 has achieved 87.3% on the GPQA Diamond benchmark ” — reportedly the first time any AI system has exceeded 85% on what’s considered one of the hardest reasoning tests available. The benchmark results from March 3 represent an 8.1 percentage point jump over the previous record, equivalent to roughly four years of prior progress compressed into a single model update.
What Makes This Different
GPQA Diamond doesn’t test pattern matching. Each question requires genuine scientific reasoning in biology, chemistry, physics, or mathematics, with questions that take PhDs 2-3 hours to answer correctly. Questions include plausible but incorrect “distractor” answers and cannot be solved through memorization.
| Model | GPQA Diamond Score |
|---|---|
| Claude 5 Opus | 87.3% |
| GPT-5 | 81.1% |
| Previous record | 79.2% |
| Gemini 3 Pro | 78.4% |
| Claude 4.5 Opus | 74.8% |
Extended Thinking Made the Difference
Standard Claude 5 mode scored 72.1%. Extended Thinking mode ” — the paid reasoning feature ” — jumped to 87.3%. That 15-point improvement came from inference-time reasoning optimization, not additional training data or larger model size.
Anthropic’s Chief Scientist commented: “This breakthrough confirms our thesis: reasoning is learnable, and scale alone was never the path forward.”
The Catch
Extended Thinking requires 40-50x more tokens, significantly increasing costs. And with human expert agreement on GPQA at only 87.9%, we may be approaching a ceiling.
OpenAI quickly released benchmark results from an unreleased GPT-5.1 model claiming 85.7%, while Google committed to GPQA Diamond focus in upcoming Gemini updates.
Source: Anthropic, March 3, 2026