IT-in-Git

turbovec: The Vector Index That Skips the Boring Parts and Still Beats FAISS

10. 6. 2026

turbovec is a Rust-backed vector index with Python bindings that implements Google's TurboQuant algorithm — compressing 31 GB of float32 embeddings down to 4 GB while beating FAISS on search speed. No training, no warmup, no drama. Here's why it matters for anyone building RAG pipelines.

FAISS is fine. It's been fine for years. But if you've ever had to rebuild a PQ index from scratch because your embedding distribution drifted, or you've watched your RAG pipeline eat 30+ GB of RAM for a 10M document corpus, you know "fine" isn't the same as "good."

turbovec is a different take on the problem. It's a vector index built on top of Google Research's TurboQuant algorithm — published at ICLR 2026 — written in Rust with Python bindings, open source, and already sitting at 10k stars on GitHub after a few weeks. That's not hype. That's engineers actually solving a real pain point.

What TurboQuant Actually Does

The core algorithm is from a paper: arXiv:2504.19874, "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." The big idea is surprisingly elegant.

Most quantization approaches optimize for mean-squared error (MSE), which is fine for reconstruction — but when you're doing vector search, you care about inner products, not reconstruction. MSE-optimal quantizers introduce systematic bias into your inner product estimates. TurboQuant fixes this by optimizing for both, without needing any training data.

Here's the trick: before quantizing, you apply a random rotation to your vectors. That rotation mixes up all the coordinates such that each one independently follows a known Beta distribution. Once you're in that space you can apply optimal scalar quantization per coordinate — the math just works out cleanly because the coordinates behave predictably. Then you store the norm separately and renormalize at query time.

The result is a codebook that's completely data-oblivious. It doesn't matter what your vectors look like; after the rotation, their distribution is what TurboQuant expects. That's the part that makes online ingestion possible — there's no training phase because the codebook doesn't depend on your data.

They claim the Lloyd-Max codebook they use achieves compression within 2.7x of Shannon's information-theoretic lower bound. For a lossy quantizer, that's genuinely impressive.

The Numbers

Let's talk compression first. A 1536-dimensional float32 vector takes 6,144 bytes. With 2-bit TurboQuant encoding, that's 384 bytes — a 16x reduction. For a 10M document corpus, you go from ~31 GB down to ~4 GB. On a machine with 16 GB of RAM, that's the difference between possible and impossible.

Search speed is where things get interesting. On ARM (Apple M3 Max), turbovec's hand-written NEON kernels beat FAISS IndexPQFastScan by 12–20% across configurations. On x86 with AVX-512BW support (Sapphire Rapids and newer Xeons), it matches or edges ahead. The implementation uses runtime CPU feature detection — it probes what your processor actually supports at startup and picks the best code path.

Recall is competitive. At d=3072 with 2-bit quantization, TurboQuant actually beats FAISS PQ (0.912 vs 0.903 R@1). At d=1536 2-bit, FAISS is slightly ahead. Both converge to essentially perfect recall by k=4–8, so for most RAG use cases the difference is academic.

Using It in Python

The API is dead simple. No surprise there — the whole point is to remove friction.

from turbovec import TurboQuantIndex

index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(vectors)  # numpy array, shape (n, dim), float32
scores, indices = index.search(query, k=10)
index.write("my_index.tv")

That's it. No training call. No index.train(training_vectors) before you can use it. You just add and search.

If you need to track your own document IDs instead of positional indices, there's IdMapIndex:

from turbovec import IdMapIndex

index = IdMapIndex(dim=1536, bit_width=4)
index.add_with_ids(vectors, np.array([1001, 1002, 1003], dtype=np.uint64))
scores, ids = index.search(query, k=10)
index.remove(1002)  # O(1) deletion

O(1) deletion is something FAISS flat indexes don't give you cleanly. With turbovec you can remove vectors at will, which matters a lot if you're building a system where documents get updated or pulled.

Filtered search is also built in — you can pass an allowlist or bitmask to restrict results at query time, and the implementation is smart about it: blocks where no allowed slots exist skip the lookup entirely.

LangChain / LlamaIndex Drop-In

For the integration story, turbovec ships optional extras:

pip install turbovec[langchain]
pip install turbovec[llama-index]
pip install turbovec[haystack]
pip install turbovec[agno]

I genuinely don't understand why more libraries don't do this — shipping integrations as optional extras in the same package, properly namespaced, is exactly the right call. You don't have to hunt for a third-party adapter that might be three months out of date.

Who Should Care

If you're running a RAG pipeline at any scale and you're not on hosted vector search, this is worth evaluating. The zero-training requirement is the really compelling part for me — not because training FAISS PQ is hard, but because it's the thing that bites you when your embedding model changes or your data distribution shifts. With turbovec, that's just not a problem you have anymore.

Honestly, the memory story alone would be enough for most homelabs. Running a 10M document knowledge base on a single beefy machine used to mean either a lot of RAM or a lot of compromises. At 4 GB for that entire corpus, you can fit this alongside a locally-running LLM without fighting the memory allocator.

The Rust core means you're not going to hit Python GIL issues under concurrent load, and the Python bindings via PyO3 are clean enough that it feels like a native Python library. The repo is MIT licensed. The paper is public. Everything checks out.

It's still early — the project appeared in early 2026 — so I'd run it against your specific workload before committing. But the benchmarks are solid and the algorithm has a real paper behind it, not just vibes and marketing copy.

Worth 30 minutes of your time to benchmark against what you're already running.

GitHub: https://github.com/RyanCodrai/turbovec*

Comments

No comments yet. Be the first to comment.

Leave a comment