Claude Sonnet 5 Benchmarks: Every Score Explained

Claude Sonnet 5 benchmark results explained: coding, reasoning, knowledge work, and agentic scores versus Sonnet 4.6 and Opus 4.8. All sourced from Anthropic.

Written by

Divit Bhat

Reviewed by

Sakthy

Last updated:

July 1, 2026

min read

Table of Contents

Heading

Benchmark tables are easy to publish and hard to interpret. A number like "63.2% on SWE-Bench Pro" tells you almost nothing unless you know what the benchmark measures, what the previous model scored, and what the ceiling looks like.

This is a full breakdown of every Claude Sonnet 5 benchmark result Anthropic published on June 30, 2026, with the context that makes each number meaningful. Every figure here comes from Anthropic's Claude Sonnet 5 announcement and the Claude Sonnet 5 System Card. Where the numbers matter, we explain what the benchmark actually tests and why the result is or is not significant.

The short version: Sonnet 5 posts major gains over Sonnet 4.6 across every evaluation Anthropic disclosed, narrows the gap to Opus 4.8 across five major evaluations, and surpasses the flagship on one. Here is the complete picture.

Claude Sonnet 5 Benchmarks: Every Score Explained

Benchmark	Sonnet 5	Sonnet 4.6	Opus 4.8	What It Measures
SWE-Bench Pro	63.2%	58.1%	69.2%	Agentic coding, hard real-world tasks
Terminal-Bench 2.1	80.4%	67.0%	~79%	Terminal-based engineering
OSWorld-Verified	81.2%	78.5%	81.7%	Computer use
Humanity's Last Exam (no tools)	43.2%	34.6%	49.8%	Hardest academic questions
Humanity's Last Exam (with tools)	57.4%	46.8%	57.9%	Same, with tool access
GDPval-AA v2 (Elo)	1618	1395	1615	Knowledge work quality

Three patterns run through this table:

Sonnet 5 beats Sonnet 4.6 on every single benchmark. The gains are not marginal. Terminal-Bench jumps more than 13 points. Knowledge work jumps 223 Elo points.
Sonnet 5 trails Opus 4.8 on raw capability, but the gap is narrow (six points on the hardest coding benchmark, and effectively zero on several others).
Sonnet 5 beats Opus 4.8 on knowledge work, the first time any Sonnet-class model has outscored the concurrent Opus flagship on any benchmark.

Let's go through what each of these actually means.

Coding Benchmarks

SWE-Bench Pro: 63.2%

This is the headline coding number, and it is important to note it is the Pro variant, not the more commonly quoted SWE-Bench Verified. SWE-Bench Pro is the harder, contamination-resistant version, so scores run lower than on Verified. Do not compare a Pro score to a Verified score from another model.

SWE-Bench Pro measures whether a model can write, run, and fix code across multiple steps, not just produce a single snippet. Sonnet 5's 63.2% is a five-point gain over Sonnet 4.6's 58.1%, bringing it within striking distance of Opus 4.8's 69.2%.

The interpretation: Sonnet 5 is a clear upgrade over the previous mid-tier model without quite matching the flagship on the hardest engineering tasks. For a developer, that six-point gap to Opus 4.8 is the price of the last increment of coding capability.

Terminal-Bench 2.1: 80.4%

This is the biggest coding jump in the lineup. Terminal-Bench 2.1 tests terminal-based engineering, running commands, managing environments, and executing multi-step technical work in a shell. Sonnet 5's leap from Sonnet 4.6's 67.0% to 80.4% is more than 13 points, and it puts Sonnet 5 roughly level with Opus 4.8 on this evaluation.

For agent builders working in terminal environments (which is most of them), this is arguably the most consequential result in the entire benchmark set. Terminal fluency is the backbone of autonomous coding agents.

Agentic and Computer Use

OSWorld-Verified: 81.2%

OSWorld-Verified measures computer-use ability: controlling a desktop to complete real tasks like navigating applications, filling forms, and manipulating files. Sonnet 5 scores 81.2%, a step up from Sonnet 4.6's revised 78.5%, and essentially tied with Opus 4.8's 81.7%.

Note that Anthropic updated its OSWorld-Verified methodology and now reports Sonnet 4.6 at 78.5% on the revised setup. This is why the Sonnet 4.6 number differs from what appeared in its original launch materials.

The takeaway: on computer-use tasks, Sonnet 5 and Opus 4.8 are effectively equivalent. If your workload is agentic computer use, there is little reason to pay for Opus.

BrowseComp

Anthropic also published cost-performance curves for BrowseComp, an agentic web-search evaluation. Rather than a single headline number, the key finding is directional: Sonnet 5 is a strict improvement over Sonnet 4.6 at every effort level, and it covers a much wider range of cost-performance options than Opus 4.8. At higher effort, its performance can match Opus 4.8 on some tasks.

(Note: Anthropic issued a changelog correction on June 30, 2026, updating the BrowseComp chart to reflect their standard agentic-search methodology, which had initially underestimated Sonnet 5's performance. The corrected methodology uses a 10M token budget with compaction and programmatic tool calling.)

Reasoning Benchmarks

Humanity's Last Exam: 57.4% (with tools), 43.2% (no tools)

Humanity's Last Exam is one of the hardest academic reasoning benchmarks in existence, spanning advanced questions across many disciplines. The two numbers reflect performance with and without tool access.

With tools, Sonnet 5 scores 57.4%, which essentially matches Opus 4.8's 57.9%, a dead heat within margin of error. This is a striking result: on hard multidisciplinary reasoning with tools, the mid-tier model performs at flagship level.

Without tools, Sonnet 5 scores 43.2% versus Opus 4.8's 49.8%, a wider six-point gap. This tells you that Opus 4.8's raw reasoning (without tool assistance) is still meaningfully stronger, but Sonnet 5 closes most of that gap once it can use tools.

Important context: Anthropic updated the grader model for Humanity's Last Exam and revised Sonnet 4.6's scores to 34.6% (no tools) and 46.8% (with tools). This is why these differ from the numbers in the original Sonnet 4.6 launch materials. Always compare against the revised figures.

Knowledge Work: The Historic Result

GDPval-AA v2: 1618 Elo

This is the most significant single number in the entire benchmark set, and not because it is the highest.

GDPval-AA v2 measures applied knowledge work: analyzing documents, synthesizing information, producing structured outputs, the kind of tasks knowledge workers do daily. It is scored as an Elo rating rather than a percentage.

Sonnet 5 scores 1618. Opus 4.8 scores 1615. Sonnet 4.6 scored 1395.

Two things make this remarkable:

The jump over Sonnet 4.6 is 223 Elo points. That is an enormous generational gain on knowledge work.
It beats Opus 4.8. The margin is slim (+3), but per Anthropic's System Card, this is the first time a Sonnet-class model has outscored the concurrent Opus flagship on any benchmark. The System Card notes Sonnet 5 ranks second at Elo 1618, statistically tied with Opus 4.8 at 1615, and trailing only Fable 5 at 1783.

For everyday professional use (document analysis, slide creation, spreadsheet work, research synthesis), this means Sonnet 5 delivers better-than-Opus quality at a fraction of the cost. On knowledge work specifically, there is no capability tradeoff in choosing Sonnet 5.

To see where Opus 4.8 still pulls ahead on harder tasks, read our full Claude Sonnet 5 vs Opus 4.8 breakdown.

The Effort Dimension Behind the Numbers

One thing the benchmark table does not show directly: how effort level affects these results.

Sonnet 5 exposes an effort parameter with levels from low to extra high (xhigh). Higher effort spends more reasoning tokens, raising both quality and cost. The benchmark scores above generally reflect the model's strong-effort performance.

The practical implication for reading these benchmarks: Sonnet 5's headline numbers are achievable, but the cost to achieve them varies with effort. At medium effort, Sonnet 5 offers substantially better cost efficiency than Opus 4.8. At xhigh effort, the cost can exceed Opus 4.8 at a comparable accuracy point. The benchmarks tell you what Sonnet 5 can do; the effort level determines what it costs you to get there.

How to Read These Benchmarks Honestly

A few caveats worth keeping in mind:

SWE-Bench is saturating. At the frontier, SWE-Bench (both Verified and Pro) is becoming a saturated benchmark. Gains of five to six points are meaningful but not structural. The more interesting signals are in independent evaluations run by companies in their own production harnesses.

Independent evals tell a consistent story. Cursor's internal CursorBench, measured in their production harness, showed Sonnet 5 at a meaningful gain over Sonnet 4.6, corroborating Anthropic's published direction of improvement.

Benchmark numbers are self-reported. These are Anthropic's own published results. They are almost certainly accurate, but as with any vendor benchmarks, real-world performance on your specific workload is the only test that fully matters.

The tokenizer affects cost, not capability. Sonnet 5 uses a new tokenizer that produces roughly 30% more tokens for the same text. This does not change the benchmark scores, but it does affect what an equivalent workload costs. See Anthropic's documentation for details.

What the Benchmarks Mean for Your Workload

Translating the numbers into practical guidance:

If you do knowledge work: Sonnet 5 is at least as good as Opus 4.8 and much cheaper. Use it.
If you do everyday coding and agentic tasks: Sonnet 5 is close to Opus 4.8 (within six points on the hardest tasks, tied on most others) at 40 to 60 percent of the cost. Use it, and escalate to Opus 4.8 only for the hardest problems.
If you do terminal-based engineering or computer use: Sonnet 5 essentially matches Opus 4.8. Use it.
If you do the hardest coding problems: Opus 4.8's six-point SWE-Bench Pro lead may justify the premium. Test both.
If you do cybersecurity work: neither benchmark set covers this well because Sonnet 5 was deliberately restricted from cyber capability. Use Opus 4.8.

For the full set of evaluations, including many not covered in the launch announcement, Anthropic publishes complete results in the Claude Sonnet 5 System Card and on its Transparency Hub.

Turning Benchmark-Grade Models Into Real Products

Benchmarks tell you what a model can do in a test harness. Shipping a product that uses that model is a different challenge entirely, and it is where most AI projects stall. Between the API and a live product sits a UI, a database, authentication, payments, hosting, observability, deployment, and an iteration loop that usually demands a full engineering team.

Emergent is the platform built to close that gap. It is an AI app builder that takes a plain-language description of what you want to build and ships a real, production-ready full-stack application. Not a prototype, not a mockup. A working product with frontend, backend, database, auth, and deployment all handled in a single coordinated pass.

What makes Emergent meaningfully different from every other AI builder in 2026 is the depth of what it generates. Most no-code tools stop at the UI. Emergent reasons through how the entire system should work before writing it, then produces real code you fully own. The output syncs directly to your GitHub repository, so there is no platform lock-in. You can export it, deploy it elsewhere, or hand it off to an engineering team.

The integration story matters when you are wiring up a model like Sonnet 5. Emergent connects to the Claude API (and any other API you need) by describing what you want to integrate. No glue code, no SDK wrangling. When something breaks in production, Emergent's multi-agent framework analyzes backend logs and resolves issues without human intervention. When requirements change, you iterate by prompt rather than rebuilding.

For teams in regulated industries, Emergent is SOC 2 Type I certified with SSO/SAML, role-based access control, and audit logging built in. That combination of consumer-grade ease and enterprise-grade compliance is what makes it a different category from both traditional no-code tools and AI coding assistants.

The benchmark score tells you the model is capable. Emergent is how you turn that capability into a product real users can use, in hours rather than months.

The Bottom Line

Claude Sonnet 5's benchmarks tell a clear story: it is the biggest generation-over-generation leap in Sonnet history. It beats Sonnet 4.6 on every published evaluation, often by double-digit margins, and it closes most of the gap to Opus 4.8 while actually surpassing the flagship on knowledge work.

The one benchmark where Opus 4.8 still clearly leads is the hardest coding tasks (SWE-Bench Pro), where its six-point advantage marks the boundary of Sonnet 5's reach. Everywhere else, the two are close enough that the price difference makes Sonnet 5 the rational default.

Read the numbers with the effort dial in mind: Sonnet 5 achieves these results, and at low to medium effort it does so far more cheaply than Opus 4.8. That combination of near-flagship capability and lower cost is what makes these benchmarks matter.

Build your app in minutes

Emergent turns your idea into a full-stack web or mobile app, no coding required.

No coding required
Web & mobile apps
Deploys instantly

Frequently Asked Questions

Your Questions, Answered

What are Claude Sonnet 5's benchmark scores?

Key scores from Anthropic's launch: SWE-Bench Pro 63.2%, Terminal-Bench 2.1 80.4%, OSWorld-Verified 81.2%, Humanity's Last Exam 57.4% (with tools) and 43.2% (no tools), and GDPval-AA v2 at 1618 Elo. All figures are from Anthropic's Claude Sonnet 5 announcement and System Card.

Did Claude Sonnet 5 beat Opus 4.8 on any benchmark?

Yes. On GDPval-AA v2, a knowledge-work benchmark, Sonnet 5 scored 1618 Elo versus Opus 4.8's 1615. Anthropic's System Card notes this is the first time a Sonnet-class model has outscored the concurrent Opus flagship on any benchmark. On most other benchmarks, Opus 4.8 still leads.

Is SWE-Bench Pro the same as SWE-Bench Verified?

No. SWE-Bench Pro is the harder, contamination-resistant variant, so scores run lower than on the more commonly quoted SWE-Bench Verified. Sonnet 5's published coding number is 63.2% on the Pro variant. Do not compare a Pro score against a Verified score from another model.

How much did Sonnet 5 improve over Sonnet 4.6?

Sonnet 5 beats Sonnet 4.6 on every published benchmark. Standout gains include Terminal-Bench 2.1 (up more than 13 points, from 67.0% to 80.4%) and GDPval-AA v2 knowledge work (up 223 Elo points, from 1395 to 1618). Anthropic revised some Sonnet 4.6 scores due to updated grading methodology.

Do the benchmarks account for Sonnet 5's new tokenizer?

The benchmark scores measure capability and are not affected by the tokenizer. However, Sonnet 5's new tokenizer produces roughly 30% more tokens for the same text, which affects the cost of an equivalent workload even though per-token pricing is unchanged. The tokenizer matters for budgeting, not for interpreting capability scores.

Start Building
on emergent today

Try Emergent

Build Full-Stack

Web & mobile apps in minutes

Continue with Google

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing, you agree to our
Terms of Service and Privacy Policy.

Claude Sonnet 5 Benchmarks: Every Score Explained

Claude Sonnet 5 Benchmarks: Every Score Explained