As the parameter scale of large language models (LLMs) continues to expand, performance optimization of inference services has become a core challenge for AI infrastructure. Traditional single-node inference solutions often struggle with long contexts and high-concurrency scenarios. DLEngine is an open-source high-performance LLM inference engine developed by the DeepLink-org team, specifically designed for production environments. It achieves a balance between low latency and high throughput through innovative architectural designs.
This project is not a simple wrapper of vLLM or TensorRT-LLM; instead, it redesigns the inference process from the ground up. Its core highlights are the Prefill-Decode disaggregation architecture and Wide Expert Parallelism strategy, making it particularly outstanding in handling MoE (Mixture of Experts) models.