TurboQuant: Reading the KV Cache Compression Breakthrough
Share
Reading time: 10 min | How Google's 3-bit compression makes long-context LLMs cheaper, and what it tells us about the next 18 months of AI inference
There is a quiet problem inside every long conversation you have with a large language model, and it is the reason those conversations get expensive. It is called the KV cache, and at long context lengths it can consume more memory than the model itself. On March 24th, a team at Google Research published TurboQuant, which compresses that cache to three bits per value with no measurable accuracy loss and no fine-tuning. Six times less memory. Up to eight times faster attention on an H100. It is worth understanding properly, because KV cache compression is one of the highest-leverage problems in deployed AI right now, and TurboQuant is the clearest public signal yet that the field has turned a corner.
I run Kentino. Part of what that involves is reading papers like this one carefully so our customers — miners, builders, curious Europeans following the AI and crypto stack — do not have to. This piece is my attempt to explain what TurboQuant actually does, how it sits inside the broader 2025-2026 wave of KV cache compression research, and what a reasonable person should expect from the next eighteen months.
The KV cache problem, stated honestly
When a transformer generates text, each new token attends to every previous token. To avoid recomputing the key and value tensors for those earlier tokens on every single step, the model stores them. That store is the KV cache.
The cache grows linearly with context length. Double the conversation, double the cache. For a mid-size 8B model running a 128k-token context in FP16, the KV cache can easily reach tens of gigabytes for a single session. The weights might be sixteen. The cache dwarfs them.
Three practical consequences follow.
First, long-context inference is memory-bound before it is compute-bound. You run out of VRAM long before you run out of FLOPs.
Second, serving cost scales badly. Every concurrent user needs their own cache. A GPU that could otherwise batch fifty short conversations might handle five long ones.
Third, on-device and edge inference stays out of reach for the models that would actually be useful there, because the cache, not the weights, is what refuses to fit.
Compressing the KV cache well — meaning aggressively, cheaply, and without hurting output quality — is therefore not a minor optimization. It changes which workloads are viable and which are not. That is the problem TurboQuant addresses.
What TurboQuant actually does
TurboQuant is a two-stage algorithm. Both stages are training-free and data-oblivious, which means no fine-tuning, no calibration dataset, no per-model tuning. You apply it and it works. That matters more than the compression ratio, honestly, because it is what lets the method drop into an existing inference stack without friction.
Stage one: PolarQuant
The first stage is PolarQuant, a separate paper by the same group (Zandieh, Mirrokni et al., AISTATS 2026). The idea is structural rather than statistical.
Quantizing high-dimensional vectors in Cartesian coordinates is awkward. The natural move — normalize to the unit sphere, then quantize the direction — turns out to be expensive, because computing the norm of every vector is the bottleneck you were trying to escape. Earlier methods paid that cost and still lost accuracy at low bit widths.
PolarQuant does two things to avoid the trap. It applies a random rotation first, which, somewhat counterintuitively, makes the geometry of the vector distribution more predictable and tractable. Then it converts to polar coordinates — a radius for magnitude, an angle for direction — and maps those onto a circular grid that can be quantized without the normalization step. The result is a clean, low-bit representation of each vector that preserves its essential geometry.
Stage two: QJL
PolarQuant alone leaves residual error. Stage two, Quantized Johnson-Lindenstrauss (QJL), fixes it with one extra bit per value.
The Johnson-Lindenstrauss transform is a classical result: you can project high-dimensional vectors into a much lower-dimensional space with a random linear map and approximately preserve pairwise distances. QJL takes this further by keeping only the sign bit of each projected coordinate — plus one, minus one, nothing else. No storage overhead beyond the bit itself.
What QJL delivers, mathematically, is an unbiased estimator of attention scores. It corrects the residual from PolarQuant without reintroducing the bias that naive low-bit schemes suffer from. That is the trick. One bit of sign, carefully chosen, is enough to clean up the stage-one error.
The numbers
Combined, the two stages land at three bits per value, six times smaller than the FP16 baseline. On an NVIDIA H100, attention logit computation runs up to eight times faster at 4-bit versus 32-bit. Google tested Gemma, Mistral, and Llama-3.1-8B-Instruct across LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. Accuracy was preserved across all five long-context benchmarks. A vector-search side-test on GloVe-200 showed superior 1@k recall against PQ and RabbiQ baselines as well, which suggests the method generalizes beyond KV caches specifically.
| Metric | Value |
|---|---|
| Bits per cached value | 3 bits |
| KV cache memory reduction | 6× |
| H100 attention speedup (4-bit vs 32-bit) | up to 8× |
| Fine-tuning required | None |
| Calibration data required | None |
| Measured accuracy loss | Zero across LongBench, NIAH, ZeroSCROLLS, RULER, L-Eval |
| Models tested | Gemma, Mistral, Llama-3.1-8B-Instruct |
The full writeup is on the Google Research blog. TurboQuant will be presented at ICLR 2026 in Rio de Janeiro.
The broader wave
TurboQuant is not alone. It is the most prominent recent entry in a fast-moving research area, and reading it without context overstates its novelty. Several other methods from late 2025 and early 2026 attack the same bottleneck from different angles.
| Method | Venue | Approach | Headline result |
|---|---|---|---|
| TurboQuant | ICLR 2026 | PolarQuant + QJL, online inference | 3 bits, 6× memory, up to 8× attention speedup, zero accuracy loss |
| KVTC (NVIDIA) | ICLR 2026 | Transform coding — PCA + adaptive quantization + entropy coding | Up to 20× compression for offline cache storage and reuse |
| ChunkKV | OpenReview, Sept 2025 | Semantic-chunk compression unit | Up to +8.7% precision at the same compression ratio |
| PM-KVQ | 2025 | Progressive mixed-precision for reasoning models | 2.73–5.18× throughput vs FP16, +8% on reasoning benchmarks |
| KVPress (NVIDIA) | Open framework | Benchmarking and deployment harness | Lets practitioners test these methods at scale |
Each targets a different niche. KVTC is for offline reuse — storing a cache from one conversation and loading it into another, where you can afford heavier encoding work in exchange for much higher compression. ChunkKV is for cases where you need to compress aggressively but preserve semantic meaning, which matters for tasks where losing a token hurts more than losing a digit of precision. PM-KVQ is tuned for the long chain-of-thought workloads that reasoning models produce. KVPress is the plumbing that lets the rest of us compare all of them honestly.
TurboQuant's distinctive contribution is the combination of training-free operation, online inference suitability, and a provably unbiased estimator. It is the one most likely to land in production frameworks first, precisely because it asks for nothing from the model operator.
What this unlocks
Stepping back from the paper and thinking about where this goes: the practical effects are easier to name than to size.
Long-context inference gets materially cheaper. If your KV cache is six times smaller, you can batch more users on the same GPU, or serve longer contexts on the same budget, or both. Anyone running an inference service feels this in their margins within a quarter of integration.
Edge deployment becomes viable for classes of models that were previously out of reach. An 8B model with long context on a workstation GPU, or a 3B model on a laptop, shifts from "barely possible" to "routine" when the cache shrinks by this factor. On-prem deployment for companies that cannot send data to cloud APIs — legal, medical, industrial telemetry — gets a similar lift.
The hardware story follows directly, and this is where it stops being abstract. Compression like TurboQuant does not change which GPUs exist; it changes which workloads fit — and right now the workloads people actually want to run on-prem are the Chinese open-weights frontier models that have quietly taken the SOTA seat through Q1 2026.
The current lineup is worth naming explicitly, because this is what customers ask us about. Kimi K2.5 from Moonshot AI — 1T total parameters, 32B active, MoE, 256K context, MIT license — released January 27th and leads code and math benchmarks among open weights. GLM-5 from Z.ai — 744B total / 40B active, 204K context, MIT-licensed — currently top of open-weights Intelligence Index and SWE-bench Verified. MiniMax M2.5 — 229B total / 10B active, 200K context — released February 12th, aggressively priced, 80%+ SWE-bench. Qwen3-Coder-Next from Alibaba — 80B total / 3B active, 256K context native, extendable to 1M with YaRN — plus the broader Qwen3 family from dense 0.8B–27B through the 397B-A17B MoE. All open weights. All shippable today.
We build machines at Kentino specifically for this workload, so let me be concrete about the math. Our flagship inference server is a 4× NVIDIA RTX 4090 build — 96 GB of pooled VRAM, AMD EPYC 7542 on an ASRock Rack ROMED8-2T, 256 GB of DDR4-2666 ECC RDIMM, 2 TB NVMe, dual 2 kW PSUs, in a 24U rack. Above that we build 4× RTX 5090 and 8× RTX 5090 configurations (128 GB and 256 GB pooled VRAM) and datacenter-grade 4× L40 / L40S (192 GB pooled ECC) for enterprise-class sustained load and 24/7 production serving.
What TurboQuant changes in this picture is the KV cache term. Modern MoE models already use compressed attention (MLA-style latent attention in Kimi, GQA in Qwen3), so their KV cache per token is smaller than older Llama-class numbers to start with. Apply TurboQuant on top and you get another ~6×. The practical effect is that the context window a given box can actually serve — as opposed to advertise — jumps meaningfully. The weights did not move. The bottleneck did.
| Kentino server build | Pooled VRAM | Model that fits comfortably | With TurboQuant KV compression |
|---|---|---|---|
| 4× RTX 4090 (AMD EPYC 7542, 256 GB ECC) | 96 GB | Qwen3-Coder-Next 80B total (FP8), Qwen3 dense 27B (FP16) | Qwen3-Coder-Next @ 256K context native single user, or 80B @ 128K for ~3-4 concurrent users |
| 4× RTX 5090 | 128 GB | Qwen3-Coder-Next with headroom, Qwen3 32B (FP16), MoE 100B-class (INT4) | Qwen3-Coder-Next @ 1M context via YaRN, or 80B @ 256K concurrent |
| 8× RTX 5090 | 256 GB | MiniMax M2.5 (FP8, ~230 GB), Qwen3 397B-A17B (INT4), GLM-5 (INT4) | MiniMax M2.5 @ full 200K context production serving, or Qwen3 397B @ 128K concurrent |
| 4× L40 / L40S | 192 GB ECC | MiniMax M2.5 (INT4), Qwen3-Coder-Next production 24/7 | Enterprise-grade serving with ECC at long context, sustained load |
Two honest caveats. First, Kimi K2.5 and GLM-5 in full FP8 (1T and 744B total weights respectively) still exceed what these boxes hold — for those you are looking at a cluster or accepting aggressive INT4 quantization. Second, exact token limits depend on batch size, the model's specific attention configuration, and framework (vLLM, SGLang, TensorRT-LLM all implement low-bit KV differently). But the direction is the one that matters: a 4× RTX 4090 box that a year ago made sense for 13B dense models is now the right answer for Qwen3-Coder-Next at its full 256K context. A 4× RTX 5090 handles the 80B active-class coding model comfortably with room for concurrent users. An 8× RTX 5090 or 4× L40S opens up MiniMax M2.5 and the larger Qwen3 MoE variants at production scale. The hardware did not get bigger; the workload got smaller.
And any inference workload that runs continuously on operational telemetry benefits proportionally. Mining-fleet optimization is one real example: operators like OneMiners run AI-driven efficiency systems across thousands of ASICs, and the inference layer underneath those systems scales directly with how much context each model can hold cheaply. This research class does not transform such workloads overnight, but it shifts the curve of what is affordable.
The honest forecast is incremental. A 6× memory reduction on one bottleneck does not produce a new world. It produces a slightly cheaper, slightly longer-context, slightly more deployable version of the world we already have. That is still a large amount of money and engineering saved, aggregated across the industry.
What to watch in 2026-2027
A few specific things, in rough order of likelihood.
Framework integration. vLLM, TensorRT-LLM, and SGLang will pick up TurboQuant-style methods within months, probably via KVPress as the benchmarking layer. The open-source Triton implementation the Google team published makes this almost mechanical.
Hardware-level support. NVIDIA has signaled interest in low-bit attention primitives through both KVTC and KVPress. Expect Blackwell-generation tooling to treat 3-4 bit KV formats as first-class citizens rather than experimental ones.
Consolidation of methods. The five approaches above solve overlapping problems. A unified stack — PolarQuant-style geometric compression for online attention, KVTC-style entropy coding for offline storage, ChunkKV-style semantic grouping as a front-end — is the likely endpoint. No single paper gets there; the stack forms over a year of integration work.
Real cost reductions in serving. By late 2026, serving costs for long-context inference should be visibly lower than they are today, with most of the gain coming from compression rather than new silicon. That is the cleanest way to predict this line of work will have succeeded.
Close
TurboQuant is a real advance on a real bottleneck, and it arrived inside a research wave that is solving the problem from several angles at once. The headline numbers are impressive on their own terms — three bits, six times, eight times — but the more important property is that it requires nothing of the model operator. Training-free, data-oblivious methods are what get deployed.
If you run long-context inference at any scale, it is worth tracking. If you do not, it is still worth understanding, because the economics of the models you will eventually use are being set, quietly, by papers like this one.