TurboQuant: 3-Bit KV Caches with Zero Accuracy Loss
Every token your LLM generates forces it to reread its entire conversational history. That history -- the Key-Value cache -- is the single largest memory bottleneck during inference. A Llama-3.1-70B serving a 128K-token context in FP16 burns through ~40 GB of VRAM on KV cache alone, leaving almost nothing for weights on a single 80 GB H100. The standard remedies -- eviction (SnapKV, PyramidKV) and sparse attention -- trade accuracy for memory. They throw tokens away.
TurboQuant, published at ICLR 2026 by Zandieh, Daliri, Hadian, and Mirrokni from Google Research, takes the opposite approach: keep every token, compress every value. At 3 bits per coordinate it delivers 6x memory reduction. At 4 bits it delivers up to 8x speedup in computing attention logits on H100 GPUs. The headline result: on LongBench with Llama-3.1-8B-Instruct, the 3.5-bit configuration scores 50.06 -- identical to the 16-bit baseline. No retraining. No fine-tuning. No calibration data.
The Research Trilogy
TurboQuant is the capstone of three papers by overlapping author groups at Google Research:
Each paper solves a distinct failure mode of naive quantization. QJL provides unbiased inner-product estimation via a 1-bit Johnson-Lindenstrauss transform. PolarQuant eliminates per-block normalization overhead through polar-coordinate decomposition. TurboQuant unifies both into a two-stage pipeline with formal distortion-rate guarantees.
How It Works: The Two-Stage Pipeline
Stage 1: Random Rotation + Lloyd-Max Scalar Quantization (b-1 bits)
The foundational insight is that applying a random orthogonal rotation to any vector causes each coordinate to follow a known, data-independent Beta distribution:
In high dimensions this converges to . Crucially, distinct coordinates become nearly independent after rotation. This collapses the intractable -dimensional vector quantization problem into independent scalar quantization problems -- each with a known, universal distribution.
Because the distribution is known analytically and is the same regardless of input data, optimal Lloyd-Max codebooks can be precomputed once offline:
| Bits | Centroids | MSE Distortion |
|---|---|---|
| 1 | 0.36 | |
| 2 | 0.117 | |
| 3 | 8 precomputed centroids | 0.03 |
| 4 | 16 precomputed centroids | 0.009 |
This is the PolarQuant contribution: by working in polar coordinates after random preconditioning, it eliminates the per-block zero-point and scale factors that other methods must store in full precision -- saving 1-2 bits of overhead per coordinate.
Stage 2: QJL Residual Correction (1 bit)
MSE-optimal scalar quantizers introduce bias in inner-product estimation. Attention scores are inner products (), so biased quantization means biased softmax weights -- which compounds across layers.
QJL corrects this with a 1-bit residual:
- Compute residual:
- Apply random projection:
- Store: and
The reconstruction combines both stages:
The critical property:
