Zing Forum

Reading

TorusInfer: Technical Analysis and Practice of a Modular Large Language Model Inference Engine

TorusInfer is an open-source modular LLM inference engine that supports advanced features like PagedAttention, continuous batching, prefix caching, and pipeline parallelism. It is compatible with the OpenAI API format and provides a high-performance solution for large-scale language model deployment.

LLM推理大语言模型推理引擎PagedAttention连续批处理流水线并行OpenAI API模型部署
Published 2026-04-08 21:15Recent activity 2026-04-08 21:21Estimated read 8 min
TorusInfer: Technical Analysis and Practice of a Modular Large Language Model Inference Engine
1

Section 01

[Introduction] TorusInfer: Core Analysis of a High-Performance Modular LLM Inference Engine

TorusInfer is an open-source modular LLM inference engine implemented with a C++ core. It supports advanced features such as PagedAttention, continuous batching, prefix caching, and pipeline parallelism. Compatible with the OpenAI API format, it provides a high-performance solution for large-scale language model deployment, addressing bottlenecks in inference performance and deployment efficiency.

2

Section 02

Project Background and Positioning

With the booming development of LLM applications today, inference performance and deployment efficiency have become key bottlenecks restricting model implementation. As an open-source modular inference engine, TorusInfer aims to provide a high-performance, scalable, and easy-to-deploy solution, supporting flexible deployment modes from single-card to multi-card. Its core value lies in the throughput and latency advantages brought by optimized features, while reducing migration costs.

3

Section 03

Core Technical Architecture and Optimization Methods

Modular Layer Design

  • Easy to extend: New model architectures can be integrated quickly
  • Fine-grained optimization: Each layer is independently tuned to adapt to hardware
  • Debug-friendly: Intuitive structure for easy problem localization

PagedAttention Memory Management

Inspired by virtual memory paging, it divides KV cache into fixed blocks (16 tokens by default), dynamically allocates and releases them, improving memory utilization, supporting dynamic batching, and longer contexts.

Continuous Batching

Parallel processing of new request prompts in the prefill phase; dynamically replacing completed requests in the decoding phase to maintain high GPU utilization. Tuning is done via max_prefill_batch_size and max_decode_batch_size.

Prefix Caching

Automatically identifies shared prefix KV cache, uses LRU eviction strategy, reduces first-token latency, suitable for dialogue systems and RAG applications.

Pipeline Parallelism

Distributes model layers across multiple GPUs, supports horizontal scaling via parameter configurations like world_size and pipeline_rank.

4

Section 04

Deployment Modes and Configuration Guide

Single Worker Mode

Suitable for scenarios with sufficient VRAM. Configurations include parameters like max_decode_batch_size, max_prefill_batch_size, and total_cache_size. The startup process involves Worker service + Scheduler service.

Multi-Worker Mode

Supports large models via pipeline parallelism. Each worker is responsible for a subset of layers. Configure stage_start_layer and stage_end_layer to define layer ranges. The startup process involves starting workers sequentially + scheduler.

5

Section 05

Performance and Benchmark Results

Tested with the Qwen2.5-7B-Instruct model:

Impact of Batch Size

Configuration Throughput (req/s) Average Latency (ms) P95 Latency (ms)
batch=1 0.05 150269 177685
batch=4 0.13 60712 78065
batch=8 0.13 54692 56917
batch=16 0.22 140990 146044

Key Metrics

  • TTFT: Time To First Token
  • TPOT: Average Time Per Output Token
  • ITL: Inter-Token Latency Example: Sequence1 metrics: Latency=8819ms, ITL=152ms, TPOT=152ms, TTFT=975ms
6

Section 06

OpenAI API Compatibility and Application Scenarios

API Compatibility

Implements the /v1/chat/completions endpoint. Request and response formats are fully compatible with the OpenAI API, supporting seamless migration of existing applications.

Application Scenarios

  • Dialogue Systems: Enable prefix caching; batch size 4-8 balances latency and throughput
  • Bulk Text Generation: Increase batch size to maximize throughput
  • Multi-Card Deployment for Large Models: Distribute models via pipeline parallelism; note network bandwidth requirements
7

Section 07

Technical Challenges and Future Directions

Current Challenges:

  • Efficient management of KV cache for long contexts
  • Optimized support for heterogeneous hardware (AMD, Intel)
  • Precision-performance trade-off in quantization and compression
  • Integration of speculative decoding technology

TorusInfer's modular architecture provides a solid foundation for future features.

8

Section 08

Summary and Practical Recommendations

TorusInfer is a fully-featured LLM inference engine that achieves high performance and compatibility through core technologies, suitable for deployment scenarios from single-card to multi-card. It is recommended that self-built LLM service teams conduct in-depth research and use its clear architecture and documentation to smoothly migrate to production environments.