Zing Forum

Reading

Agent-Infer: A High-Performance LLM Inference Engine Written Purely in Rust, Reducing First Token Latency by 4.6x

Agent-Infer is a large language model inference engine entirely written in Rust, without the need for Python glue code. Optimized via CUDA Graph and FlashInfer, it is 4.6x faster than SGLang in first token latency and supports multi-turn agent tool calls.

RustLLMinferenceCUDAGPUQwenperformanceagent
Published 2026-04-01 16:12Recent activity 2026-04-01 16:23Estimated read 4 min
Agent-Infer: A High-Performance LLM Inference Engine Written Purely in Rust, Reducing First Token Latency by 4.6x
1

Section 01

Introduction / Main Floor: Agent-Infer: A High-Performance LLM Inference Engine Written Purely in Rust, Reducing First Token Latency by 4.6x

Agent-Infer is a large language model inference engine entirely written in Rust, without the need for Python glue code. Optimized via CUDA Graph and FlashInfer, it is 4.6x faster than SGLang in first token latency and supports multi-turn agent tool calls.

2

Section 02

Performance Breakthrough: 4.6x Reduction in First Token Latency

Agent-Infer's most notable achievement comes from its outstanding performance. In comparative tests with SGLang v0.5.9, Agent-Infer shows significant advantages:

Metric Agent-Infer SGLang Improvement
TTFT (C=1) 8.6ms 39.3ms 4.6x faster
Throughput (C=1) 119.5 tok/s 121.0 tok/s Same
Throughput (C=8) 811 tok/s 886 tok/s 0.92x

TTFT (Time To First Token) is a key metric for user experience—the waiting time from when a user sends a request to receiving the first token. Agent-Infer eliminates Python scheduling overhead through the Rust runtime, combined with CUDA Graph decoding technology, compressing this latency from nearly 40ms to less than 9ms, an impressive improvement.

3

Section 03

Architecture Design: A Layered and Decoupled Modular System

Agent-Infer's architecture is divided into three distinct layers:

4

Section 04

1. Agent Layer (Rust Binary)

Responsible for agent logic, including ChatML format processing, tool call parsing, and execution loops. This layer implements the generate-parse-execute loop in multi-turn conversations and supports shell and Python code execution.

5

Section 05

2. Infer Engine Layer (Rust Library)

Core inference engine, including:

  • HTTP Service Layer: Based on the Axum framework, providing OpenAI-compatible REST APIs
  • Scheduler: Implements continuous batching, prioritizes decoding requests, supports chunked pre-filling
  • Model Implementation: Currently supports Qwen3 and Qwen3.5 (including mixed attention architecture)
  • KV Cache Management: Paged block management, prefix cache (Radix Tree), CPU offloading
  • Sampler: Supports top-k, top-p, min-p, temperature adjustment, repetition penalty, etc.
6

Section 06

3. CUDA Kernel Layer

Directly calls CUDA kernels such as FlashAttention-2, RMSNorm, GEMM/GEMV, achieving high-performance computing through Triton AOT compilation and native CUDA C implementation.

7

Section 07

Core Technical Optimizations

Agent-Infer's performance advantages come from a series of carefully designed optimization strategies:

8

Section 08

CUDA Graph Decoding

The traditional decoding process requires multiple CPU-to-GPU launch calls per step (36 layers × ~14 kernels = 504 launches). Agent-Infer captures these operations into CUDA Graphs—one graph per batch size, captured on the first call, and directly replayed in subsequent calls, completely eliminating CPU-to-GPU scheduling overhead.