Claude Opus 4.8 Benchmarks: Scores, Rankings, and Pricing (May 2026)

Claude Opus 4.8 is Anthropic’s flagship model, released on May 28, 2026, roughly 41 days after Opus 4.7. The release pushes Anthropic to the top of the independent Artificial Analysis Intelligence Index and retakes the lead on knowledge-work and scientific reasoning evaluations, while keeping the same price and the same 1 million token context window as its predecessor.

Quick answer: Opus 4.8 scores 61.4 on the Artificial Analysis Intelligence Index, the new #1 spot, beating Opus 4.7 by 4.1 points and GPT-5.5 (xhigh) by 1.2 points. It also leads on GDPval-AA and Humanity’s Last Exam, but GPT-5.5 still edges it on terminal-agent coding.

📊

Anthropic’s own benchmark numbers are self-reported. Independent index results come from Artificial Analysis. Treat both as a starting point and test on your own workload.

Artificial Analysis Intelligence Index ranking

On the composite Artificial Analysis Intelligence Index, Opus 4.8 reaches 61.4 and takes the overall lead. That is a 4.1-point jump over Opus 4.7 and 1.2 points ahead of GPT-5.5 (xhigh), the previous index leader. The gain comes from improvements in both real-world agentic work and frontier academic reasoning rather than a single category.

The model was measured at its “max” effort setting to test peak performance. Across the full index, it used about the same number of output tokens as Opus 4.7 while scoring higher, so the index gain did not come from simply spending more tokens overall.

Model	Intelligence Index
Claude Opus 4.8	61.4
GPT-5.5 (xhigh)	60.2
Claude Opus 4.7	57.3

GDPval-AA knowledge-work score

GDPval-AA is the primary evaluation for agentic performance on knowledge-work tasks. Opus 4.8 scored 1,890 Elo at launch with its “max” effort setting, which is 137 points above Opus 4.7 and 121 points ahead of the next-best model, GPT-5.5 xhigh. Head-to-head on the GDPval task set, that gap implies roughly a 67% win rate against GPT-5.5.

Efficiency improved here too. Opus 4.8 reached that score in about 15% fewer turns per task and with 35% fewer output tokens than Opus 4.7. It still uses around 30% more turns than GPT-5.5, the second-ranked model, so it is not the most economical on turn count.

Model	GDPval-AA (Elo)
Claude Opus 4.8	1,890
GPT-5.5 (xhigh)	1,769
Claude Opus 4.7	1,753

Coding benchmarks: SWE-bench, Terminal-Bench

Agentic coding is where Anthropic put the most weight. Opus 4.8 reaches 88.6% on SWE-bench Verified, up from 87.6%, but that benchmark is close to saturation, so the more meaningful jump is SWE-bench Pro, which climbs from 64.3% to 69.2%. On SWE-bench Pro it leads GPT-5.5 (58.6%) and Gemini 3.1 Pro (54.2%).

The one coding category Opus 4.8 does not top is agentic terminal work. On Terminal-Bench 2.1 it scores 74.6%, behind GPT-5.5 at 78.2% but still ahead of Gemini 3.1 Pro (70.3%) and Opus 4.7 (66.1%). On the Artificial Analysis variant, Terminal-Bench Hard, Opus 4.8 gained 6.8 points over Opus 4.7.

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
SWE-bench Verified	88.6%	87.6%	—	—
SWE-bench Pro	69.2%	64.3%	58.6%	54.2%
Terminal-Bench 2.1	74.6%	66.1%	78.2%	70.3%

Scientific and academic reasoning

Earlier Claude releases trailed rivals on hard academic reasoning. Opus 4.8 closes that gap. It leads Humanity’s Last Exam by about one point in a tight contest between Anthropic, Google DeepMind, and OpenAI, scoring 49.8% without tools and 57.9% with tools. On CritPt, a frontier physics benchmark developed by Argonne and UIUC, it now scores higher than Gemini 3.1 Pro, though it remains behind GPT-5.4 and GPT-5.5.

On GPQA Diamond it reaches 93.6%, a graduate-level reasoning test, with scores holding roughly flat versus Opus 4.7 on GPQA, AA-LCR, and SciCode.

Hallucination rate and AA-Omniscience

Opus 4.8 reaches #2 on the AA-Omniscience Index at 27.4, slightly ahead of Opus 4.7 and behind only Gemini 3.1 Pro at 32.9. Accuracy ticked up to 46.6%, and the hallucination rate held roughly flat at 35.9%. Anthropic continues to show substantially lower hallucination rates than comparable models from Google and OpenAI, which matters for tasks where confidently invented facts are costly.

A related theme is honesty about the model’s own output. Opus 4.8 is roughly four times less likely than Opus 4.7 to let a flaw in its own code pass without flagging it, meaning it is more willing to say it is unsure instead of declaring a task finished too early. One side effect shows up on Vending Bench, a long-horizon agentic test, where the more aligned behavior produces lower scores than older releases that were more willing to bluff.

Other benchmark gains over Opus 4.7

Beyond the headline tests, Opus 4.8 makes material gains on several agentic and instruction-following evaluations. Computer use improves on OSWorld-Verified, and agentic financial analysis leads the field on Finance Agent v2.

Benchmark	Opus 4.8	Change vs Opus 4.7
OSWorld-Verified (computer use)	83.4%	+0.6 (82.8%)
τ²-Bench Telecom	—	+5.9 points
IFBench (instruction following)	—	+3.6 points
Terminal-Bench Hard	—	+6.8 points
Finance Agent v2	53.9%	field-leading

Pricing, context window, and effort settings

Pricing is unchanged from Opus 4.7, at $5 per million input tokens and $25 per million output tokens. Cache writes carry a 25% premium ($6.25 per million tokens) with a 5-minute time to live, and cache hits get a 90% discount ($0.5 per million tokens). The context window stays at 1 million tokens.

A new Fast Mode runs the same model at about 2.5x the speed and is priced at roughly one-third the cost of the previous fast tier. Developers can switch it on in Claude Code with the /fast command.

Effort remains the recommended way to balance performance and latency. Anthropic now recommends “high” as the default, with “extra” and “max” available for harder jobs. Note that the level previously called “xhigh” on Opus 4.7 has been renamed to “extra,” so older and newer effort labels are not directly comparable.

Detail	Value
Standard input / output	$5 / $25 per 1M tokens
Cache write / cache hit	$6.25 / $0.5 per 1M tokens
Context window	1M tokens
API model ID	claude-opus-4-8 (or claude-opus-4-8[1m])
Default effort	high

How to switch from Opus 4.7

Set your default model to claude-opus-4-8, or use claude-opus-4-8[1m] if you need the full 1 million token context. For most teams this is the entire migration.

Pick an effort level if your client exposes one. The default “high” uses roughly the same tokens as Opus 4.7 while scoring higher, so you only need “extra” or “max” for tougher tasks.

Measure tokens per task on your own traffic, since higher effort can spend more output tokens. The model is available across the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry, and on claude.ai the new model and effort control are available with no setup.

Full details and the launch announcement are on Anthropic’s Claude Opus 4.8 page.

The short version: is that Opus 4.8 leads the overall intelligence index and most agentic and reasoning benchmarks, with GPT-5.5 holding its ground mainly on terminal-agent coding. With flat pricing, the same 1 million token context, and a cheaper Fast Mode, the upgrade is mostly a question of testing the new effort defaults against your existing Opus 4.7 traffic.