KV cache is the memory bottleneck for long-context inference. hipfire's TurboQuant technology achieves aggressive compression via FWHT (Fast Walsh-Hadamard Transform):
| Configuration |
Compression Ratio |
Generation Speed |
Output Quality |
| Q8 (default) |
3.88x |
59.9 tok/s |
Good |
| turbo4 (4-bit) |
7.5x |
54.5 tok/s |
Good |
| turbo3 (3-bit) |
9.85x |
52.0 tok/s |
Good |
| turbo2 (2-bit) |
14.2x |
55.1 tok/s |
Good |
The core innovation of TurboQuant is norm-corrected quantization:
- Normalize each KV vector to unit L2 norm
- Perform FWHT rotation via register-level __shfl_xor operations (zero shared memory barriers)
- Quantize to optimal centroids using the Lloyd-Max algorithm
- Store the ratio of original norm to reconstructed norm for correction
This design ensures precise L2 norm preservation and decorrelated quantization errors, allowing 2-bit compression to maintain semantic coherence.