# Agent-Infer: A High-Performance LLM Inference Engine Written Purely in Rust, Reducing First Token Latency by 4.6x

> Agent-Infer is a large language model inference engine entirely written in Rust, without the need for Python glue code. Optimized via CUDA Graph and FlashInfer, it is 4.6x faster than SGLang in first token latency and supports multi-turn agent tool calls.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-01T08:12:50.000Z
- 最近活动: 2026-04-01T08:23:04.743Z
- 热度: 159.8
- 关键词: Rust, LLM, inference, CUDA, GPU, Qwen, performance, agent
- 页面链接: https://www.zingnex.cn/en/forum/thread/agent-infer-rust-llm-token-4-6
- Canonical: https://www.zingnex.cn/forum/thread/agent-infer-rust-llm-token-4-6
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Agent-Infer: A High-Performance LLM Inference Engine Written Purely in Rust, Reducing First Token Latency by 4.6x

Agent-Infer is a large language model inference engine entirely written in Rust, without the need for Python glue code. Optimized via CUDA Graph and FlashInfer, it is 4.6x faster than SGLang in first token latency and supports multi-turn agent tool calls.

## Performance Breakthrough: 4.6x Reduction in First Token Latency

Agent-Infer's most notable achievement comes from its outstanding performance. In comparative tests with SGLang v0.5.9, Agent-Infer shows significant advantages:

| Metric | Agent-Infer | SGLang | Improvement |
|--------|-------------|--------|-------------|
| TTFT (C=1) | 8.6ms | 39.3ms | **4.6x faster** |
| Throughput (C=1) | 119.5 tok/s | 121.0 tok/s | Same |
| Throughput (C=8) | 811 tok/s | 886 tok/s | 0.92x |

TTFT (Time To First Token) is a key metric for user experience—the waiting time from when a user sends a request to receiving the first token. Agent-Infer eliminates Python scheduling overhead through the Rust runtime, combined with CUDA Graph decoding technology, compressing this latency from nearly 40ms to less than 9ms, an impressive improvement.

## Architecture Design: A Layered and Decoupled Modular System

Agent-Infer's architecture is divided into three distinct layers:

## 1. Agent Layer (Rust Binary)

Responsible for agent logic, including ChatML format processing, tool call parsing, and execution loops. This layer implements the generate-parse-execute loop in multi-turn conversations and supports shell and Python code execution.

## 2. Infer Engine Layer (Rust Library)

Core inference engine, including:

- **HTTP Service Layer**: Based on the Axum framework, providing OpenAI-compatible REST APIs
- **Scheduler**: Implements continuous batching, prioritizes decoding requests, supports chunked pre-filling
- **Model Implementation**: Currently supports Qwen3 and Qwen3.5 (including mixed attention architecture)
- **KV Cache Management**: Paged block management, prefix cache (Radix Tree), CPU offloading
- **Sampler**: Supports top-k, top-p, min-p, temperature adjustment, repetition penalty, etc.

## 3. CUDA Kernel Layer

Directly calls CUDA kernels such as FlashAttention-2, RMSNorm, GEMM/GEMV, achieving high-performance computing through Triton AOT compilation and native CUDA C implementation.

## Core Technical Optimizations

Agent-Infer's performance advantages come from a series of carefully designed optimization strategies:

## CUDA Graph Decoding

The traditional decoding process requires multiple CPU-to-GPU launch calls per step (36 layers × ~14 kernels = 504 launches). Agent-Infer captures these operations into CUDA Graphs—one graph per batch size, captured on the first call, and directly replayed in subsequent calls, completely eliminating CPU-to-GPU scheduling overhead.
