Claude Code Is Eating the Codebase

Anthropic held its second "Code with Claude" developer conference in San Francisco a few weeks ago and if you weren't following along it's worth going back through the announcements. Not because everything they shipped is revolutionary, but because the whole event had a specific energy — optimistic, a little breathless, and just slightly unnerving if you've been writing code for a living for more than a decade.

Let me explain what I mean.

The Shift Nobody Prepared For

The quote that stuck with me came from Boris Cherny, the head of Claude Code: "The default isn't 'I'm going to prompt Claude' — the default is now 'I'm going to have Claude prompt itself.'"

That's a real thing they said at a developer conference in 2026 and most people in the room apparently nodded along.

So, what we're talking about isn't autocomplete or a smarter Copilot. We're talking about a system where you describe a problem, walk away, come back, and code got written, tested, reviewed, and committed. The new "Routines" feature lets Claude Code run asynchronously in the background — you spawn a session, it handles a ticket, you do something else. They demoed it with a live bug fix that included idempotency logic and audit logging for a refund flow. All without a human in the loop.

That demo was impressive. And a little uncomfortable.

What Actually Shipped

At the Code with Claude event, Anthropic announced five things worth paying attention to:

Dreaming is the strangest one. It's a research preview that lets agents review their own past sessions, find patterns, and generate new memories. Self-improving agents. That's where we are now.

Outcomes lets you define success criteria upfront — basically a spec you hand the agent that it evaluates itself against. This is actually smart architecture for anyone who's tried to use Claude Code on anything non-trivial and hit the "it did the thing but not the right thing" problem.

Multi-agent orchestration is production-level now. You can coordinate fleets of Claude agents on the same task. Mercado Libre is already aiming for 90% autonomous coding by Q3. Shopify too. Not experiments — targets.

CI auto-fix hooks into pull requests and automatically corrects failing checks. Which sounds useful until you think about it for a minute.

Security Reviews landed as a standalone feature. Any engineer can now trigger a security audit of a codebase. Honestly useful — security reviews are one of those things teams always mean to do more rigorously.

They also doubled the five-hour rate limit for Pro and Max users and announced a partnership with SpaceX's Colossus cluster for compute. API volume on the Anthropic platform is up 17x year-on-year.

The Numbers Are Hard to Argue With

On the benchmark side, Claude's position is strong. 72.7% on software engineering benchmarks. GPT-4.1 sits at 54.6%. Gemini 2.5 Pro at 63.8%. For pure coding work on complex multi-module problems, Claude is genuinely ahead and the gap isn't small.

The long-context handling is real too. You can throw an entire repository at Claude Opus and ask meaningful questions about it. I've done this with a messy Django monolith we maintain and gotten genuinely useful architecture analysis back — not just "here's a summary" but "here's why your session handling is going to bite you."

The latest Opus 4.8 (released end of May) improves on 4.7 across coding, reasoning, and agentic task execution. The "advisor strategy" they're pushing now — using a small cheap model for most calls, escalating to Opus only when needed — brings frontier-quality results at a fraction of the cost. That's a real architectural pattern if you're building anything on the API.

The Part Where I Have Concerns

Here's where I'll be honest with you.

I genuinely don't understand why so many teams are rushing to maximize autonomous coding percentage before they've solved the review problem. "90% autonomous coding" sounds like a KPI, not an engineering goal. If nobody's reading the generated code because there's too much of it, you've created a different class of problem — not solved the original one.

There's already evidence this is happening. A Hacker News thread around the Code with Claude event had a comment that's hard to shake: "The only people I've heard saying that generated code is fine are those who don't read it." Technical managers at Anthropic itself are apparently overloaded managing the volume of auto-generated code from their own internal tools. That's either ironic or clarifying, depending on how you look at it.

The agentic-PRs research paper from earlier this year found that AI-labeled pull requests get rejected more often than human ones. And developers are self-reporting degraded coding skills from over-relying on these tools. These aren't hypotheticals — they're things happening right now, in the same six-month window as all the bullish adoption numbers.

The rate limit situation also deserves a mention. The $20/month plan hits limits fast in real agentic workflows. Serious daily use lands you on the $100/month Max plan. Power users running multiple agents push $200/month. That's $1,200–2,400/year per developer. Teams of ten? You're looking at a non-trivial infrastructure cost, and not every company has negotiated an enterprise deal.

Who's Winning Right Now

For developers right now, the the honest breakdown is roughly this:

Claude is the clear choice for complex reasoning tasks, large codebases, long-context analysis, and serious agentic workflows. The benchmark lead is real and you feel it on hard problems.

GPT-4.1 is cheaper for high-volume API work and has great library knowledge for standard framework tasks. OpenAI's tooling ecosystem is still wider.

Gemini 2.5 Pro has the largest context window (2M tokens) and wins on multimodal and data-heavy tasks. Cheapest per token for input.

None of them is the right answer for everything. The teams doing the most interesting work are using all three contextually, not betting on one model for every task.

What's Actually Interesting Here

The "Code with Claude" event, and the broader arc of where Claude has gone in 2025–2026, represents something real: the tooling around LLMs for code is now mature enough that the bottleneck isn't model quality, it's process and culture. Can your team review AI-generated code fast enough to maintain quality? Do you have the discipline to push back when the agent takes a shortcut? Are your engineers getting better at their craft or are they becoming prompt editors?

The technology is largely solved. The human side isn't.

That's the thing nobody wants to talk about at developer conferences.

Sources: