Claude 4.7 Dumber, Opens Smarter
Source: andrewzuo.com
TL;DR
- Claude Opus 4.7 Released: Andrew Zuo argues Anthropic's new Claude Opus 4.7 uses more tokens and shows performance regressions compared to Opus 4.6.[[1]](https://andrewzuo.com/claude-is-getting-dumber-and-open-models-are-getting-smarter-so-why-havent-you-switched-yet-f0bef0eddb80)
- Token Usage Up 33%: Opus 4.7 consumes an average third more tokens than Opus 4.6 across hundreds of runs, raising effective costs at $5 input/$25 output per million tokens.[[2]](https://tokens.billchambers.me/leaderboard)
- Open Models Rising: A top open model ranks 7th on LLM Arena, outperforming DeepSeek 3.2 and Mistral, as Claude weakens.[[3]](https://medium.com/@impure/claude-is-getting-dumber-and-open-models-are-getting-smarter-so-why-havent-you-switched-yet-f0bef0eddb80)
The story at a glance
Andrew Zuo's opinion piece critiques Anthropic's Claude Opus 4.7 release for higher token consumption and apparent quality drops versus Opus 4.6, while praising open-source models' gains on arenas like LLM Arena. He questions why users stick with Claude amid these issues and rising open alternatives. The post appeared hours after Opus 4.7's April 2026 launch amid user reports of regressions on benchmarks like Thematic Generalization (80.6% to 72.8%) and NYT Connections.[[4]](https://www.reddit.com/r/singularity/comments/1snlp29/claude_opus_47_high_unexpectedly_performs)
Key points
- Opus 4.7's new tokenizer leads to 1.3-1.47x more tokens on code and English content, confirmed by real-world samples and leaderboard data showing +37% average over 688 Opus 4.6 comparisons.[[5]](https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new-tokenizer-here-s-what-it-costs-you)[[2]](https://tokens.billchambers.me/leaderboard)
- Higher tokens mean 33% effective price hike and slower speed at same tokens/second for Opus 4.6 pricing of $5/million input, $25/million output.[[1]](https://andrewzuo.com/claude-is-getting-dumber-and-open-models-are-getting-smarter-so-why-havent-you-switched-yet-f0bef0eddb80)
- User benchmarks show Opus 4.7 regressions: Thematic Generalization 72.8% (down from 80.6%), NYT Connections Extended 41% with high refusals (vs 94.7%), MRCR v2 at 1M tokens 32.2% (vs 78.3%).[[4]](https://www.reddit.com/r/singularity/comments/1snlp29/claude_opus_47_high_unexpectedly_performs)[[6]](https://medium.com/vibe-coding/opus-4-7-is-the-worst-release-anthropic-has-ever-shipped-12772c21ca1e)
- Despite official SWE-bench gains (Verified 87.6%, Pro 64.3%), real-world tests like Openmark.ai show Opus 4.6 beating 4.7 on custom SaaS reasoning tasks.[[7]](https://www.reddit.com/r/singularity/comments/1sn52vp/claude_opus_47_benchmarks)
- Open models advancing: One unnamed open model is 7th on LLM Arena, beating DeepSeek 3.2 and Mistral; general trend of open-source catching proprietary leaders.[[3]](https://medium.com/@impure/claude-is-getting-dumber-and-open-models-are-getting-smarter-so-why-havent-you-switched-yet-f0bef0eddb80)
Details and context
Zuo spotted issues via posts on Claude Code Camp (measuring 1.325x weighted tokens on code samples) and Bill Chambers' leaderboard (+36.9% tokens for 4.7).[[5]](https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new-tokenizer-here-s-what-it-costs-you)[[2]](https://tokens.billchambers.me/leaderboard) These raise costs for heavy users, as billing is token-based, and fill context windows faster—key for long agent tasks Anthropic targets.
Regressions appear tied to safety tweaks, adaptive reasoning defaults to low-effort (saving compute), and training focus on coding over general tasks like long-context retrieval or generalization.[[7]](https://www.reddit.com/r/singularity/comments/1sn52vp/claude_opus_47_benchmarks)[[4]](https://www.reddit.com/r/singularity/comments/1snlp29/claude_opus_47_high_unexpectedly_performs) Official benchmarks highlight coding wins, but third-party evals reveal drops, echoing past "nerfs" in Claude updates.
Open models gain via rapid releases (e.g., Qwen 3.6, DeepSeek V3.x) and arenas where cost-free local runs appeal; no direct causation stated, but timing follows Opus 4.7 complaints.[[3]](https://medium.com/@impure/claude-is-getting-dumber-and-open-models-are-getting-smarter-so-why-havent-you-switched-yet-f0bef0eddb80)
Key quotes
- "Claude Opus 4.7 is out and it is… interesting. I first got wind that something was up when I saw this post on how Opus 4.7’s token usage was significantly higher." — Andrew Zuo[[1]](https://andrewzuo.com/claude-is-getting-dumber-and-open-models-are-getting-smarter-so-why-havent-you-switched-yet-f0bef0eddb80)
Why it matters
Claude's stumbles highlight risks in proprietary AI reliance, where updates can degrade usability despite hype. For developers and businesses, this means weighing higher costs and inconsistent performance against free, improving open models for many tasks. Watch independent benchmarks like LMSYS Arena or Artificial Analysis for Opus 4.7 stabilization and next open releases like potential Llama 4, though real-world fit varies by use case.[[8]](https://artificialanalysis.ai/)
What changed
Opus 4.6 used fewer tokens and scored higher on tests like Thematic Generalization (80.6%), NYT Connections (94.7%), MRCR 1M (78.3%). Opus 4.7 now uses ~37% more tokens, scores lower on those (72.8%, 41%, 32.2%), launched April 16, 2026.[[9]](https://medium.com/your-latest-ai-learnings/claude-opus-4-7-is-a-huge-pile-of-trash-6e389d152a83)[[2]](https://tokens.billchambers.me/leaderboard)
FAQ
Q: Why does Claude Opus 4.7 use more tokens than 4.6?
A: Its new tokenizer processes English/code content at 1.3-1.47x rate per samples and leaderboard data from 688 runs showing +37% average, raising costs without price change. This hits context limits and speed faster. Author calls it a stealth price hike.[[5]](https://www.claudecodecamp.com/p/i-measured-claude-4-7-s-new-tokenizer-here-s-what-it-costs-you)[[2]](https://tokens.billchambers.me/leaderboard)
Q: What benchmarks show Opus 4.7 worse than 4.6?
A: Thematic Generalization dropped to 72.8% from 80.6%, NYT Connections Extended to 41% from 94.7% (with refusals), MRCR v2 1M to 32.2% from 78.3%; custom SaaS tasks also lost per Openmark.ai tests. Coding benchmarks improved officially.[[4]](https://www.reddit.com/r/singularity/comments/1snlp29/claude_opus_47_high_unexpectedly_performs)
Q: How are open models getting smarter per the article?
A: One open model ranks 7th on LLM Arena leaderboard, outperforming DeepSeek 3.2 and Mistral, amid Claude issues signaling shift where opens close gap cost-effectively. No specific benchmarks beyond arena cited.[[3]](https://medium.com/@impure/claude-is-getting-dumber-and-open-models-are-getting-smarter-so-why-havent-you-switched-yet-f0bef0eddb80)
Q: Why hasn't the author switched from Claude yet?
A: Article poses this rhetorically without personal answer in visible text; implies inertia despite costs/regressions vs open gains, questioning reader habits.[[1]](https://andrewzuo.com/claude-is-getting-dumber-and-open-models-are-getting-smarter-so-why-havent-you-switched-yet-f0bef0eddb80)