One post tagged with "llm-efficiency"

TurboQuant: 3-Bit KV Caches with Zero Accuracy Loss

March 29, 2026 · 16 min read

Senior Software Engineer

Every token your LLM generates forces it to reread its entire conversational history. That history -- the Key-Value cache -- is the single largest memory bottleneck during inference. A Llama-3.1-70B serving a 128K-token context in FP16 burns through ~40 GB of VRAM on KV cache alone, leaving almost nothing for weights on a single 80 GB H100. The standard remedies -- eviction (SnapKV, PyramidKV) and sparse attention -- trade accuracy for memory. They throw tokens away.

TurboQuant, published at ICLR 2026 by Zandieh, Daliri, Hadian, and Mirrokni from Google Research, takes the opposite approach: keep every token, compress every value. At 3 bits per coordinate it delivers 6x memory reduction. At 4 bits it delivers up to 8x speedup in computing attention logits on H100 GPUs. The headline result: on LongBench with Llama-3.1-8B-Instruct, the 3.5-bit configuration scores 50.06 -- identical to the 16-bit baseline. No retraining. No fine-tuning. No calibration data.