Model Quantization: AWQ INT4
Adopts Activation-Aware Weight Quantization (AWQ) to compress the 7B model weights from FP16 (≈14GB) to INT4 (≈3.5GB), saving memory for KV cache and long context.
KV Cache Compression: nuq4
Compresses KV cache using a non-uniform quantization strategy, allocating more levels to frequent value ranges while preserving key attention information.
Attention Optimization: Quest top-K
Uses query-guided sparse attention, focusing only on the most relevant K historical positions, reducing computational complexity from O(n²) to O(n×K).
Ultra-Long Document Support: EM-LLM RAG
Splits ultra-long documents into chunks and builds a hierarchical index. During inference, it retrieves the most relevant chunks and handles cross-chunk dependencies via an evidence fusion mechanism.
Hot-Cold Data Exchange
Active context is kept in GPU memory, while historical context is swapped to CPU/disk and loaded on demand.
Custom Triton Kernels
Optimizes key operators like nuq4 dequantization, Quest attention, and EM-LLM retrieval to leverage Tensor Core performance.