Technology

Google's TurboQuant: 6x Memory Compression Without Accuracy Loss

By Singularity • March 31, 2026 • 6 min read

Google Research has introduced TurboQuant, a compression algorithm that reduces LLM key-value cache memory by up to 6x while delivering up to 8x speedup—all with zero accuracy loss. The breakthrough could make million-token context windows economically viable for the first time.

The Memory Wall Problem

Large language models face a fundamental constraint: the key-value (KV) cache that stores intermediate attention state scales with both model size and context length. For long contexts—1M tokens and beyond—this cache can consume gigabytes of GPU memory, making inference prohibitively expensive.

Traditional quantization approaches require offline calibration on representative datasets. They're slow to set up and fragile when input distributions change. Google's approach is different: TurboQuant is "data-oblivious," requiring no dataset-specific tuning.

How TurboQuant Works

The algorithm applies a random rotation to input vectors, inducing a concentrated Beta distribution on each coordinate regardless of the original data. In high dimensions, these coordinates become nearly independent and identically distributed.

This near-independence allows TurboQuant to solve a continuous 1D k-means problem per coordinate—finding the optimal scalar quantizer for a given bit-width. The result: compression that's provably within a small constant factor (≈ 2.7) of the theoretical limit established by Shannon.

The Numbers

Compression: 5-6x reduction in KV cache size
Bit-width: Works down to 3-bit storage with quality neutrality at 3.5 bits
Accuracy: 100% retrieval accuracy on Needle-In-A-Haystack benchmark up to 104k tokens
Indexing: ~5000x faster than Product Quantization (0.0013s vs 239.75s for 1536-dim vectors)
Speedup: Up to 8x in end-to-end LLM generation

Unbiased Inner Products

A key innovation: TurboQuant's two-stage approach eliminates the bias that plagues MSE-optimized quantizers in transformer attention. By applying a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform to the residual, it produces provably unbiased inner product estimates:

𝔼[⟨y, Q⁻¹(Q(x))⟩] = ⟨y, x⟩

This matters because transformer attention is fundamentally about computing inner products between query and key vectors. Biased estimates accumulate errors through the attention mechanism, degrading output quality.

Why This Matters

Context window economics are brutal. Anthropic's 1M context costs premium because the KV cache doesn't fit in GPU memory for typical inference setups. Compression that's both efficient and lossless changes the equation.

TurboQuant's data-oblivious design means it works instantly—no k-means training on your dataset, no calibration step, no sensitivity to input distribution shifts. The indexing time drops from hundreds of seconds to milliseconds.

For vector databases, the implications are similarly stark. Product Quantization requires expensive codebook training. TurboQuant eliminates that step while improving recall.

Availability

Google has published the research paper (ICLR 2026) and a PyTorch implementation is available as open-source: turboquant-kv on PyPI, with 324 downloads in the past week. The algorithm integrates with standard transformer architectures without architectural changes.

Sources

Google Research: "TurboQuant: Redefining AI Efficiency with Extreme Compression"
MarkTechPost: "Google Introduces TurboQuant" (March 25, 2026)
arXiv: Paper 2504.19874
PyPI: turboquant-kv v1.0.0