Zing Forum

Reading

TurboQuant+: Production-Ready LLM KV Cache and Weight Quantization Technology

An extension implementation for llama.cpp based on Google's TurboQuant paper, achieving a 4.6x KV cache compression ratio via Walsh-Hadamard rotation and polar codebook quantization technology, while supporting cross-platform backends (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan).

LLM量化KV缓存TurboQuantllama.cpp推理优化Flash Attention跨平台开源项目
Published 2026-05-20 02:14Recent activity 2026-05-20 02:20Estimated read 7 min
TurboQuant+: Production-Ready LLM KV Cache and Weight Quantization Technology
1

Section 01

TurboQuant+ Overview: Production-Grade LLM KV Cache & Weight Quantization

TurboQuant+ is a production-level implementation of Google's TurboQuant paper as an extension to llama.cpp. It uses Walsh-Hadamard rotation and polar codebook quantization to achieve up to 4.6x KV cache compression while maintaining model quality. Key features include cross-platform backend support (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan) and an additive design that preserves existing llama.cpp functionality.

2

Section 02

Background: LLM Inference Memory Bottleneck & Traditional Quantization Limitations

LLM inference faces memory bottlenecks due to linearly expanding KV cache with sequence length. Traditional MSE-based quantization fails for KV cache because:

  • Key (K): Extremely sensitive to errors (amplified by softmax, shifting attention distribution).
  • Value (V): More tolerant (error smoothed by attention weights). TurboQuant+ addresses this with asymmetric K/V compression strategies, as detailed in its companion paper Asymmetric K/V Cache Compression: Why V is Free and K is Everything.
3

Section 03

Core Technology: Walsh-Hadamard Rotation & Polar Codebook Quantization

TurboQuant+'s core algorithm involves two steps:

  1. Walsh-Hadamard Transform (WHT): Applied to 128-element blocks to flatten energy distribution, reducing outlier sensitivity and improving codebook utilization.
  2. Polar Codebook Quantization: Divides space into regions of varying reliability, assigning higher bit precision to more important regions (unlike uniform or k-means clustering).
4

Section 04

Quantization Format System: Weight & KV Cache Options

Weight Quantization:

  • TQ3_1S (~3.5 bits/weight): For resource-constrained scenarios.
  • TQ4_1S (~4.5 bits/weight): 3.5x speedup on NVIDIA (240 token/s vs baseline 68 token/s) via Metal fusion kernels and CUDA dp4a.

KV Cache Quantization:

  • Turbo2 (~2.0 bits): Radical compression (use with Boundary V protection).
  • Turbo3 (~3.5 bits): Core result (4.6x compression, <1.5% PPL loss).
  • Turbo4 (~4.5 bits): Surpasses q4_0 fidelity after quality fixes.
5

Section 05

Cross-Platform Backend Support

Apple Silicon (Metal):

  • TurboFlash (Flash Attention optimized for unified memory).
  • Sparse V decompression (skip low-weight positions).
  • Gemma4 support (dk=512 Flash Attention, MoE routing).
  • TurboFlash disabled on Apple10 (data corruption investigation).

NVIDIA CUDA:

  • dp4a instruction optimization for TQ4_1S.
  • Warp collaborative decompression (16x less per-block computation).
  • Multi-token/multi-GPU support; VEC Flash Attention (9% speedup for turbo formats).

AMD HIP/ROCm:

  • Portable dp4a (RDNA3/4, CDNA3/4).
  • Scalar half path for TQ4_1S fallback.
  • Forced vector Flash Attention for quantized KV.

Vulkan:

  • Compute shader path (nix-buildable).
  • Coopmat Flash Attention (supports turbo3).
6

Section 06

Key Technical Innovations

  1. Auto Asymmetric K/V Compression: Defaults to conservative K compression and radical V compression for balance.
  2. Boundary V (Layer-Aware Protection): Experimental feature for turbo2-V—protects layers where V quantization harms quality.
  3. Attention-Gated Sparse V Decompression: Skips low-weight V positions (saves compute on long sequences).
7

Section 07

Deployment Recommendations & Production Integration

Deployment Principle: "Start light, compress gradually" (start with lightweight asymmetric config, verify quality, incrementally tighten V compression). Avoid maximal compression first (irreversible quality loss possible).

Production Users: LocalAI (OpenAI-compatible API), Chronara (quantum-safe fintech), AtomicChat (end-side chat).

Llama.cpp Relation: Additive design—existing features work; new formats enabled via --cache-type-k/--cache-type-v and llama-quantize. Syncs with upstream master.

8

Section 08

Performance Benchmarks & Conclusion

Benchmarks: Turbo3 achieves ~4.6x KV compression with <1% PPL loss (matches Google's original paper).

Conclusion: TurboQuant+ balances quality and efficiency by leveraging attention mechanism insights. Its cross-platform support and production stability make it ideal for resource-constrained LLM deployment—no binary choice between model capability and efficiency.