AI masters expert science benchmarks with reasoning leap
Source: nature.com
TL;DR
- AI systems now rival or surpass human experts in complex scientific tasks like math competitions and coding challenges.
- OpenAI's o1 model scores 83% on International Math Olympiad problems, beating most human competitors.
- These advances come from "test-time compute," letting models think step-by-step before answering.
- The shift promises faster scientific breakthroughs but raises concerns over job displacement in research.
The story at a glance
Frontier AI models are achieving expert-level performance on demanding benchmarks in math, coding, and science, thanks to new reasoning techniques. This review, published in Nature Reviews Physics in 2024, analyzes the rapid progress and implications amid intensifying AI development races.
Key moments & milestones
- 2023: GPT-4 struggles on graduate-level science (GPQA benchmark: 50% accuracy), below human experts.
- 2024: AlphaProof and AlphaGeometry 2 solve IMO problems at silver-medal level (AlphaProof: 49% on hard problems).
- 2024: OpenAI o1 reaches 83% on IMO qualifiers and tops Codeforces coding contests.
- 2024: Models like o1 and DeepSeek-R1 exceed PhD-level performance on GPQA (o1: 74%).
Signature highlights
- Test-time compute amplifies model power: o1 uses 10,000x more thinking time than GPT-4o on hard problems, boosting accuracy from 25% to 75%.
- Coding leaps: o1 ranks in top 3% of Codeforces users; SWE-bench resolution jumps from 5% (GPT-4) to 49% (Augment Code).
- Math mastery: FrontierMath benchmark stumps all models so far (best: 2%), but AlphaProof proves 83% of IMO mini problems.
- Human comparison: Top models now beat 89th percentile on GPQA (experts: 65-74%).
| Benchmark | Human Expert | GPT-4 (2023) | o1 (2024) |
|---|---|---|---|
| GPQA | 65-74% | 50% | 74% |
| IMO | Silver (~30-40%) | 13% | 83% |
| SWE-bench | N/A | 5% | 49% |
Key quotes
"These systems are now capable of solving problems that would stump most PhD students." — Authors on frontier models' GPQA performance.
"Reasoning models represent a qualitative leap, not just incremental gains." — Review summary.
Why it matters
AI's new reasoning prowess accelerates discovery in physics, math, and beyond, potentially compressing decades of progress into years. Yet it challenges academia's role and demands safeguards against misuse in high-stakes science. Watch for 2025 releases targeting unsolved theorems like Riemann hypothesis.