Skip to main content

TurboQuant: 3-Bit KV Caches with Zero Accuracy Loss

· 16 min read
Vadim Nicolai
Senior Software Engineer

Every token your LLM generates forces it to reread its entire conversational history. That history -- the Key-Value cache -- is the single largest memory bottleneck during inference. A Llama-3.1-70B serving a 128K-token context in FP16 burns through ~40 GB of VRAM on KV cache alone, leaving almost nothing for weights on a single 80 GB H100. The standard remedies -- eviction (SnapKV, PyramidKV) and sparse attention -- trade accuracy for memory. They throw tokens away.

TurboQuant, published at ICLR 2026 by Zandieh, Daliri, Hadian, and Mirrokni from Google Research, takes the opposite approach: keep every token, compress every value. At 3 bits per coordinate it delivers 6x memory reduction. At 4 bits it delivers up to 8x speedup in computing attention logits on H100 GPUs. The headline result: on LongBench with Llama-3.1-8B-Instruct, the 3.5-bit configuration scores 50.06 -- identical to the 16-bit baseline. No retraining. No fine-tuning. No calibration data.

The Research Trilogy

TurboQuant is the capstone of three papers by overlapping author groups at Google Research:

Loading diagram…

Each paper solves a distinct failure mode of naive quantization. QJL provides unbiased inner-product estimation via a 1-bit Johnson-Lindenstrauss transform. PolarQuant eliminates per-block normalization overhead through polar-coordinate decomposition. TurboQuant unifies both into a two-stage pipeline with formal distortion-rate guarantees.

How It Works: The Two-Stage Pipeline

Loading diagram…

Stage 1: Random Rotation + Lloyd-Max Scalar Quantization (b-1 bits)

The foundational insight is that applying a random orthogonal rotation to any vector causes each coordinate to follow a known, data-independent Beta distribution:

fX(x)=Γ(d/2)πΓ((d1)/2)(1x2)(d3)/2f_X(x) = \frac{\Gamma(d/2)}{\sqrt{\pi}\,\Gamma((d{-}1)/2)}\,(1 - x^2)^{(d-3)/2}

In high dimensions this converges to N(0,1/d)\mathcal{N}(0, 1/d). Crucially, distinct coordinates become nearly independent after rotation. This collapses the intractable dd-dimensional vector quantization problem into dd independent scalar quantization problems -- each with a known, universal distribution.

Because the distribution is known analytically and is the same regardless of input data, optimal Lloyd-Max codebooks can be precomputed once offline:

BitsCentroidsMSE Distortion
1±2/(πd)\pm\sqrt{2/(\pi d)}0.36
2±0.453/d, ±1.51/d\pm 0.453/\sqrt{d},\ \pm 1.51/\sqrt{d}0.117
38 precomputed centroids0.03
416 precomputed centroids0.009

This is the PolarQuant contribution: by working in polar coordinates after random preconditioning, it eliminates the per-block zero-point and scale factors that other methods must store in full precision -- saving 1-2 bits of overhead per coordinate.

Stage 2: QJL Residual Correction (1 bit)

MSE-optimal scalar quantizers introduce bias in inner-product estimation. Attention scores are inner products (qkq \cdot k), so biased quantization means biased softmax weights -- which compounds across layers.

QJL corrects this with a 1-bit residual:

  1. Compute residual: r=xx^MSEr = x - \hat{x}_{\text{MSE}}
  2. Apply random projection: SN(0,1)m×dS \sim \mathcal{N}(0,1)^{m \times d}
  3. Store: sign(Sr)\text{sign}(S \cdot r) and γ=r2\gamma = \|r\|_2

The reconstruction combines both stages:

x~=x^MSE+γπ/2dSsign(Sr)\tilde{x} = \hat{x}_{\text{MSE}} + \gamma \cdot \frac{\sqrt{\pi/2}}{d} \cdot S^\top \cdot \text{sign}(S \cdot r)

The critical property: E[y,x~]=y,x\mathbb{E}[\langle y, \tilde{x}\rangle] = \langle y, x\rangle -- the inner-product estimator is unbiased. Combined budget: (b1)(b{-}1) bits for MSE quantizer + 1 bit for QJL = bb total bits per coordinate.

Benchmark Results

LongBench (Llama-3.1-8B-Instruct)

Loading diagram…
MethodBitsSingleQAMultiQASummarizationFewShotSyntheticCodeAverage
Full Cache1645.2945.1626.5568.3859.5446.2850.06
TurboQuant3.545.0145.3126.0068.6359.9546.1750.06
TurboQuant2.544.1644.9624.8068.0159.6545.7649.44
PolarQuant3.945.1844.4826.2368.2560.0745.2449.78
KIVI343.3837.9927.1668.3859.5044.6848.50

At 3.5 bits TurboQuant exactly matches full precision. At 2.5 bits the degradation is marginal (0.62 points). KIVI at 3 bits loses 1.56 points, with MultiQA dropping sharply (-7.17).

Needle-in-a-Haystack: Quantization vs Eviction

Loading diagram…

TurboQuant achieves perfect retrieval across all test lengths up to 104K context. The eviction-based methods (PyramidKV, SnapKV) drop catastrophically because they may evict exactly the token containing the needle.

Quantization Speed (Vector Search Application)

Methodd=200d=1536d=3072
Product Quantization37.04s239.75s494.42s
RaBitQ597.25s2267.59s3957.19s
TurboQuant0.0007s0.0013s0.0021s

Five orders of magnitude faster. Because the codebooks are data-oblivious and precomputed, quantization reduces to a single matrix multiply + nearest-centroid lookup.

Theoretical Guarantees

TurboQuant's MSE distortion is within a factor of 3π/22.7×\sqrt{3}\pi/2 \approx 2.7\times of the information-theoretic lower bound. No algorithm operating at the same bit budget can fundamentally do more than ~2.7x better. At 1 bit, the gap tightens to ~1.45x.

The inner-product distortion bound:

Dprod3π2y2d14bD_{\text{prod}} \leq \frac{\sqrt{3}\,\pi^2\,\|y\|^2}{d} \cdot \frac{1}{4^b}

This exponential decay in bb is why jumping from 2 to 3 bits cuts distortion by 4x, making the 3-bit regime viable.

Caveats and Critical Analysis

The "zero loss" claim is strong but scoped:

  • GSM8K at 3-bit shows 1.4-point degradation (84.3% vs 85.7%), suggesting math/reasoning tasks are more sensitive to quantization noise in attention scores
  • Prefill runs in full BF16 -- only the decode phase uses quantized cache, which favors generation-light benchmarks
  • The "8x speedup" is measured against FP32 baselines, not production FP16
  • Evaluation is limited to models up to ~8B parameters
  • The ArXiv-to-camera-ready version changed some LongBench scores without explanation

The community implementation effort in llama.cpp (discussion #20969, 32+ contributors) has also surfaced practical findings:

  • Keys need more bits than values. Modern LLMs display 4-182x norm disparities between K and V. Optimal strategy: q8_0 for keys + turbo3 for values
  • MSE-only often outperforms QJL in practice. Multiple implementations report the QJL bias-correction stage actually degrades softmax ranking quality
  • On 70B models with 34 GB VRAM, FP16 supports ~109K tokens; TQ3 supports ~536K tokens -- a 5x capacity gain

Practical Deployment

Loading diagram…

4-bit is the sweet spot for 3B+ models -- indistinguishable from FP16. 3-bit works for 8B+ but degrades on smaller architectures. Most production deployments keep the most recent 128-256 tokens in full FP16 and compress only older cache entries.

The compression benefit scales with context length. At fewer than 1K tokens the overhead of rotation matrices dominates. At 8K+ tokens you save 2+ GB. At 128K tokens you save tens of gigabytes -- enough to move from multi-GPU to single-GPU serving.

TurboQuant pairs cleanly with weight quantization (GPTQ, AWQ). A 70B model with 4-bit weights and 3-bit KV cache fits in 34 GB VRAM with 500K+ token capacity. That's a single A100 or a Mac Studio with 64 GB unified memory.

Industry Implications

TrendForce and Morgan Stanley note that TurboQuant does not reduce absolute memory demand -- it enables 4-8x longer context windows or larger batch sizes on existing hardware. The commercial effect is utilization optimization, not HBM demand reduction. Memory supply constraints persist.

The deeper impact is architectural. Data-oblivious quantization means the same CUDA kernel works across all models, all layers, without calibration. That makes it a systems-level primitive -- something that belongs in the serving stack (vLLM, TensorRT-LLM, llama.cpp) rather than requiring per-model tuning. The llama.cpp community already has working implementations across Metal, CUDA, and Vulkan backends.

This is probably the strongest "zero loss" claim in the current KV cache compression landscape. The formal distortion bounds, the data-oblivious design, and the consistent benchmark results across QA, code, and summarization make TurboQuant the method to beat at the 3-4 bit operating point.

References