AI masters expert science benchmarks with reasoning leap

Source: nature.com

TL;DR

The story at a glance

Frontier AI models are achieving expert-level performance on demanding benchmarks in math, coding, and science, thanks to new reasoning techniques. This review, published in Nature Reviews Physics in 2024, analyzes the rapid progress and implications amid intensifying AI development races.

Key moments & milestones

Signature highlights

BenchmarkHuman ExpertGPT-4 (2023)o1 (2024)
GPQA65-74%50%74%
IMOSilver (~30-40%)13%83%
SWE-benchN/A5%49%

Key quotes

"These systems are now capable of solving problems that would stump most PhD students." — Authors on frontier models' GPQA performance.

"Reasoning models represent a qualitative leap, not just incremental gains." — Review summary.

Why it matters

AI's new reasoning prowess accelerates discovery in physics, math, and beyond, potentially compressing decades of progress into years. Yet it challenges academia's role and demands safeguards against misuse in high-stakes science. Watch for 2025 releases targeting unsolved theorems like Riemann hypothesis.