modelai-llama.cpp provides nine different compression methods to adapt to different application scenarios and performance requirements:
The select method is the fastest default option, suitable for latency-sensitive scenarios. It selects which KV vectors to keep based on simple heuristic rules with minimal computational overhead.
The solver method pursues higher compression quality by solving an optimization problem to obtain the best compressed representation. This method supports running on Apple Silicon GPUs, making full use of Metal acceleration capabilities.
Other methods include omp (Orthogonal Matching Pursuit), self_study (Self-Learning Compression), chunked (Chunked Compression), on_policy (Policy Gradient Optimization), nonuniform (Non-Uniform Compression), sequential_on_policy (Sequential Policy Optimization), and context_prefill (Context Prefill). Each method has its specific applicable scenarios and trade-offs.