Claude Sonnet 5 Benchmarks: Every Score Explained
Claude Sonnet 5 benchmark results explained: coding, reasoning, knowledge work, and agentic scores versus Sonnet 4.6 and Opus 4.8. All sourced from Anthropic.
Benchmark tables are easy to publish and hard to interpret. A number like "63.2% on SWE-Bench Pro" tells you almost nothing unless you know what the benchmark measures, what the previous model scored, and what the ceiling looks like.
This is a full breakdown of every Claude Sonnet 5 benchmark result Anthropic published on June 30, 2026, with the context that makes each number meaningful. Every figure here comes from Anthropic's Claude Sonnet 5 announcement and the Claude Sonnet 5 System Card. Where the numbers matter, we explain what the benchmark actually tests and why the result is or is not significant.
The short version: Sonnet 5 posts major gains over Sonnet 4.6 across every evaluation Anthropic disclosed, narrows the gap to Opus 4.8 across five major evaluations, and surpasses the flagship on one. Here is the complete picture.
Claude Sonnet 5 Benchmarks: Every Score Explained
Three patterns run through this table:
- Sonnet 5 beats Sonnet 4.6 on every single benchmark. The gains are not marginal. Terminal-Bench jumps more than 13 points. Knowledge work jumps 223 Elo points.
- Sonnet 5 trails Opus 4.8 on raw capability, but the gap is narrow (six points on the hardest coding benchmark, and effectively zero on several others).
- Sonnet 5 beats Opus 4.8 on knowledge work, the first time any Sonnet-class model has outscored the concurrent Opus flagship on any benchmark.
Let's go through what each of these actually means.
Coding Benchmarks
SWE-Bench Pro: 63.2%
This is the headline coding number, and it is important to note it is the Pro variant, not the more commonly quoted SWE-Bench Verified. SWE-Bench Pro is the harder, contamination-resistant version, so scores run lower than on Verified. Do not compare a Pro score to a Verified score from another model.
SWE-Bench Pro measures whether a model can write, run, and fix code across multiple steps, not just produce a single snippet. Sonnet 5's 63.2% is a five-point gain over Sonnet 4.6's 58.1%, bringing it within striking distance of Opus 4.8's 69.2%.
The interpretation: Sonnet 5 is a clear upgrade over the previous mid-tier model without quite matching the flagship on the hardest engineering tasks. For a developer, that six-point gap to Opus 4.8 is the price of the last increment of coding capability.
Terminal-Bench 2.1: 80.4%
This is the biggest coding jump in the lineup. Terminal-Bench 2.1 tests terminal-based engineering, running commands, managing environments, and executing multi-step technical work in a shell. Sonnet 5's leap from Sonnet 4.6's 67.0% to 80.4% is more than 13 points, and it puts Sonnet 5 roughly level with Opus 4.8 on this evaluation.
For agent builders working in terminal environments (which is most of them), this is arguably the most consequential result in the entire benchmark set. Terminal fluency is the backbone of autonomous coding agents.
Agentic and Computer Use
OSWorld-Verified: 81.2%
OSWorld-Verified measures computer-use ability: controlling a desktop to complete real tasks like navigating applications, filling forms, and manipulating files. Sonnet 5 scores 81.2%, a step up from Sonnet 4.6's revised 78.5%, and essentially tied with Opus 4.8's 81.7%.
Note that Anthropic updated its OSWorld-Verified methodology and now reports Sonnet 4.6 at 78.5% on the revised setup. This is why the Sonnet 4.6 number differs from what appeared in its original launch materials.
The takeaway: on computer-use tasks, Sonnet 5 and Opus 4.8 are effectively equivalent. If your workload is agentic computer use, there is little reason to pay for Opus.
BrowseComp
Anthropic also published cost-performance curves for BrowseComp, an agentic web-search evaluation. Rather than a single headline number, the key finding is directional: Sonnet 5 is a strict improvement over Sonnet 4.6 at every effort level, and it covers a much wider range of cost-performance options than Opus 4.8. At higher effort, its performance can match Opus 4.8 on some tasks.
(Note: Anthropic issued a changelog correction on June 30, 2026, updating the BrowseComp chart to reflect their standard agentic-search methodology, which had initially underestimated Sonnet 5's performance. The corrected methodology uses a 10M token budget with compaction and programmatic tool calling.)
Reasoning Benchmarks
Humanity's Last Exam: 57.4% (with tools), 43.2% (no tools)
Humanity's Last Exam is one of the hardest academic reasoning benchmarks in existence, spanning advanced questions across many disciplines. The two numbers reflect performance with and without tool access.
With tools, Sonnet 5 scores 57.4%, which essentially matches Opus 4.8's 57.9%, a dead heat within margin of error. This is a striking result: on hard multidisciplinary reasoning with tools, the mid-tier model performs at flagship level.
Without tools, Sonnet 5 scores 43.2% versus Opus 4.8's 49.8%, a wider six-point gap. This tells you that Opus 4.8's raw reasoning (without tool assistance) is still meaningfully stronger, but Sonnet 5 closes most of that gap once it can use tools.
Important context: Anthropic updated the grader model for Humanity's Last Exam and revised Sonnet 4.6's scores to 34.6% (no tools) and 46.8% (with tools). This is why these differ from the numbers in the original Sonnet 4.6 launch materials. Always compare against the revised figures.
Knowledge Work: The Historic Result
GDPval-AA v2: 1618 Elo
This is the most significant single number in the entire benchmark set, and not because it is the highest.
GDPval-AA v2 measures applied knowledge work: analyzing documents, synthesizing information, producing structured outputs, the kind of tasks knowledge workers do daily. It is scored as an Elo rating rather than a percentage.
Sonnet 5 scores 1618. Opus 4.8 scores 1615. Sonnet 4.6 scored 1395.
Two things make this remarkable:
- The jump over Sonnet 4.6 is 223 Elo points. That is an enormous generational gain on knowledge work.
- It beats Opus 4.8. The margin is slim (+3), but per Anthropic's System Card, this is the first time a Sonnet-class model has outscored the concurrent Opus flagship on any benchmark. The System Card notes Sonnet 5 ranks second at Elo 1618, statistically tied with Opus 4.8 at 1615, and trailing only Fable 5 at 1783.
For everyday professional use (document analysis, slide creation, spreadsheet work, research synthesis), this means Sonnet 5 delivers better-than-Opus quality at a fraction of the cost. On knowledge work specifically, there is no capability tradeoff in choosing Sonnet 5.
To see where Opus 4.8 still pulls ahead on harder tasks, read our full Claude Sonnet 5 vs Opus 4.8 breakdown.
The Effort Dimension Behind the Numbers
One thing the benchmark table does not show directly: how effort level affects these results.
Sonnet 5 exposes an effort parameter with levels from low to extra high (xhigh). Higher effort spends more reasoning tokens, raising both quality and cost. The benchmark scores above generally reflect the model's strong-effort performance.
The practical implication for reading these benchmarks: Sonnet 5's headline numbers are achievable, but the cost to achieve them varies with effort. At medium effort, Sonnet 5 offers substantially better cost efficiency than Opus 4.8. At xhigh effort, the cost can exceed Opus 4.8 at a comparable accuracy point. The benchmarks tell you what Sonnet 5 can do; the effort level determines what it costs you to get there.
How to Read These Benchmarks Honestly
A few caveats worth keeping in mind:
SWE-Bench is saturating. At the frontier, SWE-Bench (both Verified and Pro) is becoming a saturated benchmark. Gains of five to six points are meaningful but not structural. The more interesting signals are in independent evaluations run by companies in their own production harnesses.
Independent evals tell a consistent story. Cursor's internal CursorBench, measured in their production harness, showed Sonnet 5 at a meaningful gain over Sonnet 4.6, corroborating Anthropic's published direction of improvement.
Benchmark numbers are self-reported. These are Anthropic's own published results. They are almost certainly accurate, but as with any vendor benchmarks, real-world performance on your specific workload is the only test that fully matters.
The tokenizer affects cost, not capability. Sonnet 5 uses a new tokenizer that produces roughly 30% more tokens for the same text. This does not change the benchmark scores, but it does affect what an equivalent workload costs. See Anthropic's documentation for details.
What the Benchmarks Mean for Your Workload
Translating the numbers into practical guidance:
- If you do knowledge work: Sonnet 5 is at least as good as Opus 4.8 and much cheaper. Use it.
- If you do everyday coding and agentic tasks: Sonnet 5 is close to Opus 4.8 (within six points on the hardest tasks, tied on most others) at 40 to 60 percent of the cost. Use it, and escalate to Opus 4.8 only for the hardest problems.
- If you do terminal-based engineering or computer use: Sonnet 5 essentially matches Opus 4.8. Use it.
- If you do the hardest coding problems: Opus 4.8's six-point SWE-Bench Pro lead may justify the premium. Test both.
- If you do cybersecurity work: neither benchmark set covers this well because Sonnet 5 was deliberately restricted from cyber capability. Use Opus 4.8.
For the full set of evaluations, including many not covered in the launch announcement, Anthropic publishes complete results in the Claude Sonnet 5 System Card and on its Transparency Hub.
Turning Benchmark-Grade Models Into Real Products
Benchmarks tell you what a model can do in a test harness. Shipping a product that uses that model is a different challenge entirely, and it is where most AI projects stall. Between the API and a live product sits a UI, a database, authentication, payments, hosting, observability, deployment, and an iteration loop that usually demands a full engineering team.
Emergent is the platform built to close that gap. It is an AI app builder that takes a plain-language description of what you want to build and ships a real, production-ready full-stack application. Not a prototype, not a mockup. A working product with frontend, backend, database, auth, and deployment all handled in a single coordinated pass.
What makes Emergent meaningfully different from every other AI builder in 2026 is the depth of what it generates. Most no-code tools stop at the UI. Emergent reasons through how the entire system should work before writing it, then produces real code you fully own. The output syncs directly to your GitHub repository, so there is no platform lock-in. You can export it, deploy it elsewhere, or hand it off to an engineering team.
The integration story matters when you are wiring up a model like Sonnet 5. Emergent connects to the Claude API (and any other API you need) by describing what you want to integrate. No glue code, no SDK wrangling. When something breaks in production, Emergent's multi-agent framework analyzes backend logs and resolves issues without human intervention. When requirements change, you iterate by prompt rather than rebuilding.
For teams in regulated industries, Emergent is SOC 2 Type I certified with SSO/SAML, role-based access control, and audit logging built in. That combination of consumer-grade ease and enterprise-grade compliance is what makes it a different category from both traditional no-code tools and AI coding assistants.
The benchmark score tells you the model is capable. Emergent is how you turn that capability into a product real users can use, in hours rather than months.
The Bottom Line
Claude Sonnet 5's benchmarks tell a clear story: it is the biggest generation-over-generation leap in Sonnet history. It beats Sonnet 4.6 on every published evaluation, often by double-digit margins, and it closes most of the gap to Opus 4.8 while actually surpassing the flagship on knowledge work.
The one benchmark where Opus 4.8 still clearly leads is the hardest coding tasks (SWE-Bench Pro), where its six-point advantage marks the boundary of Sonnet 5's reach. Everywhere else, the two are close enough that the price difference makes Sonnet 5 the rational default.
Read the numbers with the effort dial in mind: Sonnet 5 achieves these results, and at low to medium effort it does so far more cheaply than Opus 4.8. That combination of near-flagship capability and lower cost is what makes these benchmarks matter.

Emergent turns your idea into a full-stack web or mobile app, no coding required.
- No coding required
- Web & mobile apps
- Deploys instantly
Frequently Asked Questions
Your Questions, Answered
on emergent today
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.






