Reading

Dooly: A Configuration-Agnostic, Redundancy-Aware Performance Profiling System for LLM Inference Simulation

Dooly marks the source of input dimensions via taint propagation, enabling performance profiling for multiple configurations with a single inference pass. It reduces the profiling GPU time for 12 models by 56.4% while maintaining simulation accuracy of 5% for TTFT and 8% for TPOT.

LLM推理优化性能剖析配置模拟GPU效率推理延迟预测自动调优

Published 2026-05-09 00:44Recent activity 2026-05-11 11:54Estimated read 5 min

Dooly: A Configuration-Agnostic, Redundancy-Aware Performance Profiling System for LLM Inference Simulation

Section 01

[Introduction] Dooly: Core Interpretation of a Configuration-Agnostic Performance Profiling System for LLM Inference Simulation

Dooly is a configuration-agnostic, redundancy-aware performance profiling system for LLM inference simulation. Addressing the high cost of full re-profiling in traditional simulators, it marks the source of input dimensions via taint propagation, enabling profiling for multiple configurations with a single inference pass. While maintaining simulation accuracy of 5% for TTFT and 8% for TPOT, it reduces the profiling GPU time for 12 models by 56.4%, providing an efficient solution for configuration optimization in LLM deployment.

Section 02

Background: Complexity of LLM Inference Configurations and Bottlenecks of Traditional Simulators

In practical LLM deployment, configuration options (hardware selection, service engine, attention backend, model parameters, etc.) are complex, and the optimal configuration varies with workloads (input sequence length, output length distribution, concurrent request patterns). Traditional simulators require full re-profiling for each configuration (e.g., new batch size, attention backend), leading to extremely high configuration exploration costs (testing 12 models requires hundreds to thousands of GPU hours).

Section 03

Core Mechanisms of Dooly: Configuration-Agnostic and Redundancy-Aware Profiling

Core Insight: The input dimension sources of LLM operations are only model configurations or request parameters, and configuration values have duplicates. Key mechanisms include: 1. Taint Propagation: Marks dimension sources, identifies reusable profiling results and dynamic parameter dependencies; 2. Selective Profiling and Latency Database: Reuses existing results and only profiles unrecorded operations; 3. Stateful Operation Handling: Reuses service engine initialization code to ensure environment consistency.

Section 04

Performance Evaluation: Dual Breakthroughs in Accuracy and Efficiency

Validated on A100/H100 GPUs, three attention backends (FlashAttention/xFormers/PyTorch Native), and 12 models: 1. Simulation Accuracy: Mean absolute percentage error ≤5% for TTFT and ≤8% for TPOT; 2. Efficiency: Reduces GPU time by 56.4% (due to operation deduplication, dimension reuse, and incremental database); 3. Compatibility: The latency database can directly replace the backend of existing simulators, enabling plug-and-play.

Section 05

Practical Application Value: Facilitating LLM Deployment Optimization

Configuration Space Exploration: Explore a larger configuration space within a reasonable budget; 2. Model Selection: Quickly evaluate the performance of new models on existing infrastructure; 3. Capacity Planning: Accurate hardware capacity planning; 4. Auto-Tuning: Integrate into MLOps pipelines for continuous optimization.

Section 06

Technical Insights: Structure-Aware System Design Principles

Dooly successfully reveals that when dealing with complex configuration spaces, understanding data structures is more important than increasing computing resources. This idea can be extended to scenarios such as hyperparameter search in deep learning training and configuration tuning of distributed systems. The key is to identify independent variables and redundant parameters to avoid redundant computations.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15