The inference process of large language models usually consists of two stages: the Prefill stage (computationally intensive) and the Decode stage (memory intensive).
As context length continues to increase, models need to maintain a large KV Cache (key-value cache), which poses significant challenges to system memory management.
Traditional OS memory allocation mechanisms tend to cause severe memory fragmentation when handling such large-capacity, dynamically changing GPU memory demands. Fragmentation not only limits the effective utilization of GPUs but also directly affects inference throughput and latency performance. Existing inference frameworks like vLLM and SGLang have made many optimizations at the application layer, but they are still limited by the underlying memory management mechanisms of the OS.
The core idea of the Lattice project is to push optimizations down to the OS level, fundamentally solving the performance bottlenecks of LLM inference through kernel-level memory management and network optimization.