The benchmarking framework implements three KV cache management modes to form a complete comparison dimension:
Standard Mode: Uses FP16 precision for keys and values. This is the default configuration of MLX-LM, serving as a quality benchmark and performance reference. This mode does not perform any compression, maintaining the highest numerical precision but with the largest memory footprint.
MLX-Quantized Mode: Uses MLX-LM's built-in QuantizedKVCache, which quantizes both keys and values to 4 bits. This is a native quantization solution in the Apple Silicon ecosystem, serving as a competitor comparison for TurboQuant.
TurboQuant Mode: Adopts a combined strategy of 3-bit key compression and 2-bit value compression. This is the core test subject of the project, integrating multiple advanced technologies such as random rotation, Lloyd-Max codebook, and QJL symbol sketch.