Zing Forum

Reading

Dooly: A Configuration-Agnostic, Redundancy-Aware Performance Profiling System for LLM Inference Simulation

Dooly marks the source of input dimensions via taint propagation, enabling performance profiling for multiple configurations with a single inference pass. It reduces the profiling GPU time for 12 models by 56.4% while maintaining simulation accuracy of 5% for TTFT and 8% for TPOT.

LLM推理优化性能剖析配置模拟GPU效率推理延迟预测自动调优
Published 2026-05-09 00:44Recent activity 2026-05-11 11:54Estimated read 5 min
Dooly: A Configuration-Agnostic, Redundancy-Aware Performance Profiling System for LLM Inference Simulation
1

Section 01

[Introduction] Dooly: Core Interpretation of a Configuration-Agnostic Performance Profiling System for LLM Inference Simulation

Dooly is a configuration-agnostic, redundancy-aware performance profiling system for LLM inference simulation. Addressing the high cost of full re-profiling in traditional simulators, it marks the source of input dimensions via taint propagation, enabling profiling for multiple configurations with a single inference pass. While maintaining simulation accuracy of 5% for TTFT and 8% for TPOT, it reduces the profiling GPU time for 12 models by 56.4%, providing an efficient solution for configuration optimization in LLM deployment.

2

Section 02

Background: Complexity of LLM Inference Configurations and Bottlenecks of Traditional Simulators

In practical LLM deployment, configuration options (hardware selection, service engine, attention backend, model parameters, etc.) are complex, and the optimal configuration varies with workloads (input sequence length, output length distribution, concurrent request patterns). Traditional simulators require full re-profiling for each configuration (e.g., new batch size, attention backend), leading to extremely high configuration exploration costs (testing 12 models requires hundreds to thousands of GPU hours).

3

Section 03

Core Mechanisms of Dooly: Configuration-Agnostic and Redundancy-Aware Profiling

Core Insight: The input dimension sources of LLM operations are only model configurations or request parameters, and configuration values have duplicates. Key mechanisms include: 1. Taint Propagation: Marks dimension sources, identifies reusable profiling results and dynamic parameter dependencies; 2. Selective Profiling and Latency Database: Reuses existing results and only profiles unrecorded operations; 3. Stateful Operation Handling: Reuses service engine initialization code to ensure environment consistency.

4

Section 04

Performance Evaluation: Dual Breakthroughs in Accuracy and Efficiency

Validated on A100/H100 GPUs, three attention backends (FlashAttention/xFormers/PyTorch Native), and 12 models: 1. Simulation Accuracy: Mean absolute percentage error ≤5% for TTFT and ≤8% for TPOT; 2. Efficiency: Reduces GPU time by 56.4% (due to operation deduplication, dimension reuse, and incremental database); 3. Compatibility: The latency database can directly replace the backend of existing simulators, enabling plug-and-play.

5

Section 05

Practical Application Value: Facilitating LLM Deployment Optimization

  1. Configuration Space Exploration: Explore a larger configuration space within a reasonable budget; 2. Model Selection: Quickly evaluate the performance of new models on existing infrastructure; 3. Capacity Planning: Accurate hardware capacity planning; 4. Auto-Tuning: Integrate into MLOps pipelines for continuous optimization.
6

Section 06

Technical Insights: Structure-Aware System Design Principles

Dooly successfully reveals that when dealing with complex configuration spaces, understanding data structures is more important than increasing computing resources. This idea can be extended to scenarios such as hyperparameter search in deep learning training and configuration tuning of distributed systems. The key is to identify independent variables and redundant parameters to avoid redundant computations.