Claude Opus 4.7 vs. Opus 4.6: What Changed and Whether It Matters

Claude Opus 4.7 launched on April 16, 2026, arriving just over two months after Opus 4.6 hit general availability in early February. Both models sit at the top of Anthropic’s Claude lineup, targeting developers and enterprises who need high-end reasoning, agentic coding, and long-context work. The pricing is identical — $5 per million input tokens and $25 per million output tokens — so the real question is whether the newer model delivers enough improvement to justify switching workflows.

Quick answer: Opus 4.7 is a direct upgrade over Opus 4.6 with measurable gains in coding benchmarks, instruction following, vision resolution, and agentic reliability. Pricing stays the same, but token consumption may increase due to a new tokenizer and deeper thinking at higher effort levels.

Benchmark Gains: Opus 4.7 vs. Opus 4.6

Anthropic positions Opus 4.7 as a meaningful step forward in software engineering, and the benchmark numbers back that up across several evaluations. The model outperforms Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on most listed benchmarks, though it still trails Claude Mythos Preview — Anthropic’s more powerful but limited-release model.

Benchmark	Opus 4.6	Opus 4.7	Change
SWE-bench Verified	80.8%	Higher (exact figure in Anthropic’s charts)	Improvement
Terminal-Bench 2.0	65.4%	Higher	Improvement
CursorBench	58%	70%	+12 points
Finance Agent	60.7%	State-of-the-art	Improvement
GDPval-AA (office tasks)	1606 Elo	State-of-the-art	Improvement
Rakuten-SWE-Bench	Baseline	3× more tasks resolved	Major lift
XBOW Visual Acuity	54.5%	98.5%	+44 points

The CursorBench jump from 58% to 70% is one of the more concrete third-party numbers. Cursor’s CEO called it “a meaningful jump in capabilities, particularly for its autonomy and more creative reasoning.” Rakuten’s internal SWE-Bench showed Opus 4.7 resolving three times more production tasks than its predecessor, with double-digit gains in both code quality and test quality scores.

Instruction Following Is Tighter — Sometimes Too Tight

One of the most notable behavioral shifts in Opus 4.7 is how literally it interprets instructions. Where Opus 4.6 would loosely interpret or skip parts of a prompt, Opus 4.7 takes directions at face value. Anthropic explicitly warns that prompts written for earlier models can produce unexpected results because the new model no longer glosses over vague or contradictory instructions.

⚠️

If you’re migrating from Opus 4.6, re-test your existing prompts. Opus 4.7’s stricter instruction following means that imprecise prompts may behave differently than before.

This is a double-edged improvement. For structured agentic workflows — CI/CD pipelines, multi-step code refactors, automated testing — tighter instruction adherence means fewer surprises. For casual or conversational use, it can feel rigid. Several early-access testers flagged this as the single biggest adjustment when upgrading.

Vision Gets a Major Upgrade

Opus 4.7 accepts images up to 2,576 pixels on the long edge, roughly 3.75 megapixels. That’s more than three times the resolution supported by previous Claude models. This isn’t a parameter you toggle — images sent to the model are simply processed at higher fidelity automatically.

The practical impact shows up in tasks that depend on fine visual detail. XBOW’s autonomous penetration testing saw its visual-acuity benchmark jump from 54.5% with Opus 4.6 to 98.5% with Opus 4.7. Solve Intelligence reported major improvements in reading chemical structures and interpreting complex technical diagrams for patent workflows.

The tradeoff is token cost. Higher-resolution images consume more tokens during processing. If your use case doesn’t require the extra detail, you can downsample images before sending them to keep costs in check.

New Effort Level and Token Consumption Changes

Opus 4.7 introduces a new xhigh (“extra high”) effort level that sits between the existing high and max settings. In Claude Code, the default effort level has been raised to xhigh for all plans. Anthropic recommends starting with high or xhigh effort for coding and agentic use cases.

Two changes affect token usage and are worth planning for:

Change	Impact	Mitigation
Updated tokenizer	Same input text maps to roughly 1.0–1.35× more tokens	Monitor token counts; adjust budgets accordingly
Deeper thinking at higher effort	More output tokens, especially on later turns in agentic sessions	Use the effort parameter, task budgets, or prompt for conciseness

Anthropic says the net effect is favorable on their internal coding evaluation — token usage across all effort levels improved when measured against task completion — but real-world results will vary. Box reported a 56% reduction in model calls, 50% fewer tool calls, 24% faster responses, and 30% fewer AI Units consumed in their evaluations, suggesting the efficiency gains can be substantial in structured enterprise workflows.

Agentic Reliability: Fewer Tool Errors, Better Follow-Through

The most consistent theme from early-access testers is that Opus 4.7 finishes what it starts. Notion’s AI lead reported a 14% improvement over Opus 4.6 at fewer tokens and a third of the tool errors. Factory’s testing showed a 10–15% lift in task success with fewer tool errors and more reliable follow-through on validation steps. Genspark highlighted loop resistance as the most critical production improvement — Opus 4.7 achieves the highest quality-per-tool-call ratio they’ve measured.

Hex’s CTO offered a useful efficiency framing: “low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6.” If that holds broadly, teams can get Opus 4.6-level quality at lower compute cost by running Opus 4.7 at a reduced effort setting.

Memory and Long-Context Work

Both models share a 1 million token context window and support up to 128,000 output tokens. Opus 4.7 doesn’t expand these limits, but Anthropic says it’s better at using file system-based memory. The model remembers important notes across long, multi-session work and applies them to new tasks with less up-front context needed.

Databricks reported 21% fewer errors than Opus 4.6 when working with source information on their OfficeQA Pro benchmark. For developers running extended agentic sessions — the kind that span hours of autonomous work — this improved document reasoning and memory persistence is where the upgrade is most felt.

Safety and Cybersecurity Safeguards

Opus 4.7 ships with new safeguards that automatically detect and block requests related to prohibited or high-risk cybersecurity uses. This is part of Anthropic’s broader strategy around Project Glasswing, where the company is testing cyber-specific safety measures on less capable models before eventually applying them to Mythos-class releases.

The model’s overall safety profile is similar to Opus 4.6, with low rates of deception, sycophancy, and cooperation with misuse. Anthropic notes modest improvements in honesty and resistance to prompt injection attacks, but acknowledges that Opus 4.7 is “modestly weaker” in its tendency to give overly detailed harm-reduction advice on controlled substances. Mythos Preview remains the best-aligned model Anthropic has trained.

Security professionals who need Opus 4.7 for legitimate purposes like vulnerability research or penetration testing can apply through Anthropic’s new Cyber Verification Program.

Claude Code Updates Shipping Alongside Opus 4.7

The model launch comes bundled with several Claude Code improvements:

Feature	What It Does	Availability
/ultrareview command	Runs a dedicated review session that reads changes and flags bugs a careful reviewer would catch	Pro and Max users get three free ultrareviews
Auto mode	Claude makes decisions on your behalf during long tasks, reducing permission interruptions	Now available to Max users (previously Teams/Enterprise/API only)
Task budgets (public beta)	Guides Claude’s token spend so it can prioritize work across longer runs	API users

Opus 4.7 vs. Opus 4.6: Who Should Upgrade

Opus 4.7 is a direct replacement — same API structure, same pricing, same context window. The claude-opus-4-7 model ID works across the Claude API, Amazon Bedrock, Google Cloud’s Vertex AI, and Microsoft Foundry. Prompts written for Opus 4.6 generally work without modification, though the stricter instruction following means you should test before switching production traffic.

For developers doing complex, long-running agentic work — multi-file refactors, autonomous debugging, CI/CD automation — the upgrade is straightforward. The gains in tool-call accuracy, loop resistance, and follow-through directly reduce the kind of failures that waste compute and developer time. For teams primarily using Claude for conversational tasks or simple completions, the differences are less dramatic, and the potential increase in token consumption from the new tokenizer is worth monitoring.

Anthropic has been shipping Opus upgrades on a roughly two-month cadence — Opus 4.5 in November 2025, Opus 4.6 in February 2026, and now Opus 4.7 in April 2026. If that pattern holds, expect Sonnet 4.8 to follow within a few weeks, and Opus 4.8 sometime around June. The pace is accelerating, which means the window to get comfortable with each version keeps shrinking.