Reading

LLM-Emu: LLM Inference Native Runtime Simulation Based on Profiling Sampling

LLM-Emu is a service-native simulator for vLLM that retains production-grade HTTP, scheduling, KV cache, and output processing paths, replacing only GPU forward execution with profiling-sampled latency and synthetic output tokens. Across various GPUs, models, and workloads, TPOT and ITL errors are within 4.8%, end-to-end latency error is 5.3%, and output throughput error is 1.9%.

LLM服务vLLM系统仿真性能评估GPU推理服务优化负载测试开源工具

Published 2026-05-01 20:35Recent activity 2026-05-04 10:57Estimated read 5 min

LLM-Emu: LLM Inference Native Runtime Simulation Based on Profiling Sampling

Section 01

[Introduction] LLM-Emu: Core Introduction to the Service-Native LLM Inference Simulator

LLM-Emu is a service-native simulator for vLLM. Its core innovation lies in retaining production-grade HTTP service layers, schedulers, KV cache management, and output processing paths, while replacing GPU forward execution with profiling-sampled latency and synthetic output tokens. Across various GPUs, models, and workloads, TPOT and ITL errors are ≤4.8%, end-to-end latency error ≤5.3%, and output throughput error only 1.9%, providing a low-cost, high-fidelity experimental tool for LLM service system research.

Section 02

Background: Cost Dilemmas in LLM Service Evaluation and Limitations of Existing Simulators

Evaluating LLM services requires considering complex factors like online workloads and dynamic request arrivals, but real GPU experiments are costly—leading to limited iterations, difficulty testing extreme scenarios, and high hardware barriers. Existing simulators have limitations such as offline operation, rewritten schedulers, or reliance on precise operator models, which easily deviate from actual production environments.

Section 03

Design Philosophy and Technical Implementation: Key to Service-Native Simulation

Design Philosophy

LLM-Emu adopts a "service-native" strategy, retaining vLLM's production-grade components (HTTP layer, Continuous Batching scheduling, KV cache, output processing) and only replacing GPU execution to capture real system behaviors like scheduling decisions and KV cache dynamics.

Technical Implementation

Profiling Sampling: Offline recording of latency on target GPUs → building latency models → runtime sampling to simulate GPU execution;
Synthetic Output Tokens: Generating synthetic tokens consistent with real token formats and supporting streaming;
vLLM Integration: Plug-in replacement of GPU modules for minimal intrusion and version compatibility.

Section 04

Experimental Validation: Accuracy Performance of LLM-Emu

Test Matrix

Covers configurations like GPUs (2 architectures), models (4 variants), model families (2), attention backends (2), workloads (Poisson + ShareGPT burst loads), etc.

Accuracy Metrics

TPOT/ITL error ≤4.8%;
End-to-end latency error ≤5.3%;
Throughput error: 1.9%;
Maximum TTFT error: 10.4% (due to sensitivity to queue status).

Section 05

Application Scenarios: Practical Value of LLM-Emu

Scheduling Strategy Research: Rapid iteration of scheduling algorithms;
Capacity Planning: Predicting the impact of hardware configurations and workloads;
Extreme Scenario Testing: Security testing for DDoS/burst traffic;
Multivariate Analysis: Completing parameter exploration in hours that would take weeks on real GPUs.

Section 06

Limitations and Future Directions

Current Limitations

TTFT accuracy needs improvement;
Only adapted to vLLM;
New models require pre-profiling;
GPU-specific optimizations not fully modeled.

Future Directions

Adaptive latency models;
Multi-GPU simulation;
Heterogeneous hardware support;
Online learning to improve models.

Section 07

Insights and Conclusion: Balance Between Simulation and Reality

Insights

Balance fidelity and cost: LLM-Emu retains key components and replaces GPU execution;
Open-source value: Lowers research barriers;
Simulation complements reality: Accelerates iteration but requires real GPU validation.

Conclusion

LLM-Emu eliminates GPU dependencies, accelerates LLM service innovation, and is a powerful tool for research and engineering.