Pure C Implementation Design Philosophy
Adhering to the principles of readability, modifiability, and embeddability, forward propagation is concentrated in a single file. The modular structure allows custom quantization types, replacement of attention kernels, etc., with zero framework dependencies and support for multiple platforms (Linux, macOS, Windows, iOS, Android, WASM).
Delta KV Cache Compression Technology
Traditional KV cache stores complete key vectors. quant.cpp uses Delta mode to store the difference between adjacent key vectors (similar to video P-frames). The adjacent difference is only about 30%, allowing quantization with fewer bits. Experiments show that without Delta, 3-bit quantization increases PPL by 62%, while with Delta, it only increases by 1.3%.
Multi-level Quantization Configuration
Provides flexible options:
- Delta + 3-bit K + Q4 V: ~4.3x compression, PPL +1.3% (max context scenario)
- Delta +4-bit K + Q4 V: ~3.8x compression, almost no PPL loss (balanced first choice)
- Uniform 4-bit K + Q4 V:3.8x compression, PPL decreases by7.8% (no Delta overhead)
An FP32 anchor is stored every 64 tokens to prevent error accumulation.