Shopify Rewrote Their GraphQL Engine and Got 15x Faster. The Fix Was Embarrassingly Obvious in Hindsight.

Most GraphQL performance work ends up being about the same stuff: N+1 queries, missing dataloaders, over-fetching. Shopify just published the story of a different problem — one that sits lower in the stack and affects every query you run, not just the poorly written ones.

Their new execution engine, GraphQL Cardinal, replaces depth-first traversal with a breadth-first model. The numbers from production: 15x faster field-level execution, 90% less memory, 4+ seconds off P50 for their largest queries. That's not a micro-optimization. That's "we were doing this fundamentally wrong the whole time."

What depth-first actually costs you

Here's the thing about recursive depth-first execution that nobody talks about: it doesn't amortize anything.

Imagine a query for 100 products, each with 100 variants, each with 5 fields. With depth-first traversal, you're resolving 100 × 100 × 5 = 50,000 field calls. Each one is its own resolver invocation. Each one spawns its own promise. You end up with ~50,000 promise allocations and ~100,000 promise callbacks — just for the traversal plumbing, before any actual I/O happens.

Shopify's napkin math for a simpler version: 5 fields across 1,000 objects.

Depth-first: 5,000 field resolver calls, 5,000 promise allocations
Breadth-first: 5 field resolver calls, 5 promise allocations

That's not an optimization. That's fixing the algorithm.

The memory picture is just as bad. Every object's subtree gets processed independently. There's no sharing, no batching, no awareness that you're about to do the same thing 999 more times. CPU cache locality? Forget it. Each recursive call jumps around memory chasing different object graphs.

They found another gem too: empty field-level tracing hooks — just the wrappers, not even doing anything useful — caused ~10% degradation across 1K fields purely from accumulated wrapper overhead. And lazy fields going through a graphql-batch dataloader workflow ran 2.5x slower than equivalent non-lazy fields, even with no I/O at all. The overhead was entirely in the promise machinery.

What breadth-first changes

Instead of "resolve everything under object A, then everything under object B," Cardinal does "resolve field X for all objects at depth N, then move to depth N+1."

Resolvers get called once per field with a complete set of objects. Not once per object. You get a batch of 1,000 product objects, resolve their title field for all of them at once, get the 1,000 titles back, move on. Promise allocations drop from O(objects × fields) to O(fields).

The engine is driven by an enqueue model rather than recursion, which also flattens stack traces and reduces the memory footprint of the call stack itself.

Three phases: build an execution tree from the static AST, run a bottom-up planning pass where fields register preloads and hints to influence parent resolver strategies, then execute layer by layer. Result hashes are keyed in-place and passed by reference — no copying large data structures around.

It's worth flagging the error handling trade-off. Depth-first execution can bubble exceptions surgically through a subtree. Breadth-first doesn't have subtrees — it runs the whole level to completion, inlines errors into the response tree, then does a depth traversal pass at the end to handle nullability propagation. Less precise, but Shopify noted that under 1% of their API traffic results in non-validation errors, so the trade-off is acceptable. Your mileage may vary depending on how error-heavy your schema is.

The migration was three phases, not a big bang

This is the part that's actually useful if you're thinking "okay cool, but I can't rewrite 10,000 resolvers this quarter."

Shopify ran three distinct migration phases:

Phase 1: They built a GraphQL Ruby interpreter layer that let Cardinal puppet the legacy resolver interface. Existing resolvers kept working unchanged. You opt in at the execution engine level, not at the field level.

Phase 2: Tracer migration. Field-level tracers previously fired per object — now they fire once per field selection across all objects. This alone killed a massive chunk of the instrumentation overhead.

Phase 3: Incrementally converting tens of thousands of legacy resolvers to native breadth-first implementations. They built a shadow verifier that ran migrated fields in parallel with legacy ones to confirm output equivalence, plus a benchmark suite to track before/after performance. They also used AI (specifically Claude skills — yes, Anthropic's tooling, which is a little funny given the context) to help accelerate the translation work.

The notable quote from the post: "All regressions can be attributed to mistakes in translation — we have yet to find a non-error scenario where breadth-based execution is fundamentally worse off." Strong signal that the model is sound, not just fast on benchmarks.

Why nobody fixed this sooner

Honestly, I don't fully understand why the GraphQL spec community didn't push on this harder. The spec explicitly says conformance requirements "can be fulfilled in any way as long as the perceived result is equivalent." Depth-first isn't mandated. It's just what the reference implementations did, and everyone copied it.

Airbnb experimented with batched resolvers. WunderGraph played with breadth batching at the federated subgraph level. There's even a graphql-breadth-exec project floating around. But nobody with Shopify's scale sat down and built it as a native engine behavior with a production migration story until now.

The graphql-js community apparently got a look at Shopify's benchmarks. Whether anything changes in the JavaScript ecosystem remains to be seen — but the data is hard to ignore.

What this means if you're running GraphQL at scale

If you're running a Ruby GraphQL backend, watch the graphql-ruby gem — Shopify collaborated with the maintainer on the new execution module and it's presumably heading upstream.

For everyone else, the takeaway is the mental model shift: stop thinking of GraphQL execution as "resolve a tree" and start thinking of it as "resolve layers of a graph in batches." Even if you can't rewrite your engine, this changes how you structure resolvers. Group operations by depth. Avoid resolvers that do per-object work when the same operation could be batched across the full object set. That's what dataloaders were always supposed to give you — Cardinal just makes it the default instead of an opt-in afterthought.

The other lesson is boring but important: measure your middleware overhead. Empty tracing hooks eating 10% across 1K fields isn't something that shows up in an APM dashboard trace. It shows up when someone sits down and profiles at the micro level. If you haven't done that with your GraphQL setup, you probably should.

Sources: