IT-in-Git

Claude Opus 4.8: An Honest Developer's Take (It's Complicated)

3. 6. 2026

Anthropic dropped Claude Opus 4.8 and the reactions split predictably: half the community raving about benchmark supremacy, the other half staring at their API bills. Both camps are right. At $25 per million output tokens — and thinking tokens that count twice — you can burn through serious budget before you realize what's happening. One documented case: 62 million tokens in 24 hours, hitting a $2,500 monthly cap overnight. That's the model doing exactly what you asked. Whether that's a feature or a problem depends entirely on how you're using it.

Anthropic dropped Claude Opus 4.8 on May 28th and the reactions have been predictably split. Half the community is writing breathless posts about benchmark supremacy. The other half is posting screenshots of their API bills and quietly crying. Both camps are right, and that's what makes this model genuinely interesting to talk about.

I've spent some time with 4.8 across different workloads — long-context reasoning, agentic coding, and just regular back-and-forth — and here's what I actually think.

What it genuinely crushes

Let's start with the real stuff because 4.8 earns its flagship title in several areas.

Coding is the clear win. SWE-Bench Pro at 69.2% — that's not a rounding error over GPT-5.5's 58.6% or Gemini 3.1 Pro's 54.2%. On production-adjacent engineering tasks, that 10-point gap matters. And it's not just benchmark performance. Anthropic made an honest engineering decision here: 4.8 is roughly four times less likely than 4.7 to let code defects slide past without flagging them. If you've ever had a model confidently generate buggy code and then tell you it's fine, you know how genuinely annoying that is. This addresses a real pain point.

Mathematical reasoning took a huge jump. USAMO 2026 went from 69.3% to 96.7%. Twenty-seven points in one generation. The long-context retrieval story is similarly dramatic — GraphWalks F1 at 1M tokens went from 40.3% to 68.1%. These aren't incremental improvements; they suggest something architectural changed in how the model handles multi-step reasoning chains.

Dynamic Workflows is the real headline feature, even if Anthropic buried it under the benchmark tables. The model can now spin up and coordinate parallel subagents natively. There are reports of people running it on 750,000-line Rust repositories — the kind of repo-scale refactoring work that would've required serious custom orchestration before. For the right use case, this is legitimately powerful.

Drop-in upgrade. Change claude-opus-4-7 to claude-opus-4-8 in your API call. Done. No breaking changes. The fact that I have to mention this as a positive says something about the industry, but credit where it's due.

Now let's talk about the token problem

Here's where things get uncomfortable.

Opus 4.8 is verbose. Not in the way that Opus 4.7 was verbose (that model had comment-verbosity and tool-calling bloat issues that 4.8 actually fixed), but verbose in a different, more fundamental way. When the model is thinking, it is really thinking — and thinking tokens are expensive.

Here's the thing developers need to internalize: thinking tokens count as output tokens, then count again as input tokens when they come back in history. This compounds fast. One documented case involved 62 million tokens burned in 24 hours, which hit a $2,500 monthly budget overnight. That's not a bug; it's the model doing exactly what you asked it to do at high effort levels.

At $25 per million output tokens on the standard tier, you can build up serious costs before you realize what's happening. Fast Mode drops that to $50/$10 per million — a meaningful cut — but "Fast Mode is now practically essential rather than optional" is a weird place to land for a flagship model.

The other complaint I've seen repeatedly is what one reviewer called the "wicked loop of refactoring" — at high effort, the model becomes hypervigilant about edge cases and micro-optimizations and just... keeps going. It finds more things to fix. Then more. If you're running it in an agentic setup without a hard exit condition, you'll watch it rewrite the same three functions five times while your token counter climbs. The practical fix is to dial the effort down to Medium or Low when you need it to commit to a decision rather than keep polishing.

Honestly, I find the five-position effort slider (low / medium / high / xhigh / max) slightly annoying to reason about. It's a behavioral signal, not a hard cap, which means you have to develop intuition for what each level actually does in practice. That takes time. Meanwhile GPT-5.5 just... runs and stops.

The honesty improvement is real and underrated

I want to come back to something before the "when to use it" section, because I don't think it's getting enough attention.

Previous Claude models in the 4.x line had a habit of being too literal — they'd complete what you asked, avoid filling in obvious gaps, and sometimes let session quality degrade across long agentic runs without surfacing the degradation. 4.8 is noticeably better here. It'll tell you when something doesn't look right even if you didn't ask. For production code and long Claude Code sessions, that's not a small thing. Silent failures in agentic pipelines are a specific kind of miserable that you don't forget.

So when should you actually reach for Opus 4.8?

The benchmark people will tell you "always, it's the best." The cost-optimizers will tell you "never, use Sonnet." Both are wrong.

Use Opus 4.8 when:

  • You're doing repository-scale engineering work where correctness and reliability are the constraint, not cost
  • You need long-context reasoning across genuinely complex documents (1M token window, and it actually uses it well)
  • The math or scientific reasoning in your task is hard enough that lighter models keep failing
  • You're running agentic workflows where silent failures are expensive and you can tolerate the token cost
  • You want the model to catch its own mistakes without you having to ask

Don't bother with Opus 4.8 when:

  • You're doing simple Q&A, summarization, or light code generation — Sonnet does that fine at a fraction of the cost
  • You're calling it in a high-throughput API loop where latency matters
  • You don't have explicit effort-level control or a clear token budget guardrail in place — unconstrained, it'll burn your budget on edge cases you didn't care about
  • Your task doesn't actually require extended thinking (which is most tasks)

The real mental model is this: use Opus 4.8 when the cost of a wrong or incomplete answer exceeds the cost of inference by a wide margin. For everything else, run lighter and faster.

The bottom line

Opus 4.8 is a genuinely better model than 4.7 across almost every meaningful dimension. The coding improvements are real. The honesty calibration is real. The long-context performance jump is real. If you're building systems where these properties matter, it deserves serious consideration.

But the token story is the thing that'll bite you if you treat it like a drop-in upgrade with no other changes. You need budget guardrails. You need to think about effort levels. You need to know when you're paying for extended thinking and when you're just paying for verbose output that a cheaper model would've handled fine.

It's a powerful tool with a real cost profile. Treat it like one.

Sources:

Comments

No comments yet. Be the first to comment.

Leave a comment