# LLM-Emu: LLM Inference Native Runtime Simulation Based on Profiling Sampling

> LLM-Emu is a service-native simulator for vLLM that retains production-grade HTTP, scheduling, KV cache, and output processing paths, replacing only GPU forward execution with profiling-sampled latency and synthetic output tokens. Across various GPUs, models, and workloads, TPOT and ITL errors are within 4.8%, end-to-end latency error is 5.3%, and output throughput error is 1.9%.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-01T12:35:21.000Z
- 最近活动: 2026-05-04T02:57:13.725Z
- 热度: 88.6
- 关键词: LLM服务, vLLM, 系统仿真, 性能评估, GPU推理, 服务优化, 负载测试, 开源工具
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-emu-llm
- Canonical: https://www.zingnex.cn/forum/thread/llm-emu-llm
- Markdown 来源: floors_fallback

---

## [Introduction] LLM-Emu: Core Introduction to the Service-Native LLM Inference Simulator

LLM-Emu is a service-native simulator for vLLM. Its core innovation lies in retaining production-grade HTTP service layers, schedulers, KV cache management, and output processing paths, while replacing GPU forward execution with profiling-sampled latency and synthetic output tokens. Across various GPUs, models, and workloads, TPOT and ITL errors are ≤4.8%, end-to-end latency error ≤5.3%, and output throughput error only 1.9%, providing a low-cost, high-fidelity experimental tool for LLM service system research.

## Background: Cost Dilemmas in LLM Service Evaluation and Limitations of Existing Simulators

Evaluating LLM services requires considering complex factors like online workloads and dynamic request arrivals, but real GPU experiments are costly—leading to limited iterations, difficulty testing extreme scenarios, and high hardware barriers. Existing simulators have limitations such as offline operation, rewritten schedulers, or reliance on precise operator models, which easily deviate from actual production environments.

## Design Philosophy and Technical Implementation: Key to Service-Native Simulation

### Design Philosophy
LLM-Emu adopts a "service-native" strategy, retaining vLLM's production-grade components (HTTP layer, Continuous Batching scheduling, KV cache, output processing) and only replacing GPU execution to capture real system behaviors like scheduling decisions and KV cache dynamics.

### Technical Implementation
1. **Profiling Sampling**: Offline recording of latency on target GPUs → building latency models → runtime sampling to simulate GPU execution;
2. **Synthetic Output Tokens**: Generating synthetic tokens consistent with real token formats and supporting streaming;
3. **vLLM Integration**: Plug-in replacement of GPU modules for minimal intrusion and version compatibility.

## Experimental Validation: Accuracy Performance of LLM-Emu

### Test Matrix
Covers configurations like GPUs (2 architectures), models (4 variants), model families (2), attention backends (2), workloads (Poisson + ShareGPT burst loads), etc.

### Accuracy Metrics
- TPOT/ITL error ≤4.8%;
- End-to-end latency error ≤5.3%;
- Throughput error: 1.9%;
- Maximum TTFT error: 10.4% (due to sensitivity to queue status).

## Application Scenarios: Practical Value of LLM-Emu

1. **Scheduling Strategy Research**: Rapid iteration of scheduling algorithms;
2. **Capacity Planning**: Predicting the impact of hardware configurations and workloads;
3. **Extreme Scenario Testing**: Security testing for DDoS/burst traffic;
4. **Multivariate Analysis**: Completing parameter exploration in hours that would take weeks on real GPUs.

## Limitations and Future Directions

### Current Limitations
- TTFT accuracy needs improvement;
- Only adapted to vLLM;
- New models require pre-profiling;
- GPU-specific optimizations not fully modeled.

### Future Directions
- Adaptive latency models;
- Multi-GPU simulation;
- Heterogeneous hardware support;
- Online learning to improve models.

## Insights and Conclusion: Balance Between Simulation and Reality

### Insights
- Balance fidelity and cost: LLM-Emu retains key components and replaces GPU execution;
- Open-source value: Lowers research barriers;
- Simulation complements reality: Accelerates iteration but requires real GPU validation.

### Conclusion
LLM-Emu eliminates GPU dependencies, accelerates LLM service innovation, and is a powerful tool for research and engineering.
