Zing Forum

Reading

disagg: A Tool for Disaggregation and Heterogeneity Exploration in Data Center LLM Inference

An open-source tool for exploring disaggregation strategies and heterogeneous chip configurations in data center LLM inference. It supports multiple disaggregation axes such as prefill/decode separation, attention/expert separation, and speculative decoding, helping developers find the optimal Pareto frontier among throughput, interactivity, and cost.

LLM推理数据中心异构计算拆解策略预填充解码分离MoE推测解码性能优化成本优化GPU
Published 2026-06-07 12:41Recent activity 2026-06-07 12:51Estimated read 6 min
disagg: A Tool for Disaggregation and Heterogeneity Exploration in Data Center LLM Inference
1

Section 01

[Introduction] disagg: A Tool for Disaggregation and Heterogeneity Exploration in Data Center LLM Inference

This article introduces the open-source tool disagg, which aims to explore disaggregation strategies and heterogeneous chip configurations in data center LLM inference. It supports multiple disaggregation axes including prefill/decode separation, attention/expert separation, and speculative decoding, helping developers find the optimal Pareto frontier among throughput, interactivity, and cost. The project is maintained by epsteinj, sourced from GitHub (link: https://github.com/epsteinj/disagg), and released on 2026-06-07T04:41:15Z.

2

Section 02

Project Background and Motivation

With the widespread deployment of LLMs in data centers, inference efficiency has become a key bottleneck for cost and user experience. Traditional homogeneous deployment models struggle to fully utilize the characteristics of different hardware and balance throughput, interaction latency, and cost per token. The disagg project is forked from the transformer_math tool and deeply extended to address its limitation of "not modeling heterogeneity", aiming to enable developers to explore the Pareto frontier under different chip combinations and disaggregation strategies.

3

Section 03

Core Features and Disaggregation Axes

disagg supports three disaggregation axes: 1. Prefill/Decode Separation: Assign prefill (compute-intensive) and decode (memory-access-intensive) tasks to different chip pools to optimize KV cache transfer; 2. Attention/Expert Separation: For MoE models, deploy attention layers (requiring high-bandwidth memory) and expert layers (requiring large-capacity memory) to different hardware; 3. Speculative Decoding: Separate the draft model and target model, with a built-in acceptance rate model to estimate a 2-3x speedup and evaluate the value of hardware investment.

4

Section 04

Technical Architecture and User Interface

Technical Architecture: The core engine is derived from transformer_math, including a chip performance catalog, model presets, FLOPs calculation, roofline model, and parallel strategy planner; enhancements include sustained effective computing power conventions (using actual MFU/bandwidth efficiency), MoE low-batch fix (solving over-prediction issues), and a two-tier memory model (supporting fast/cold memory tiering). User Interface: A self-contained web interface that supports disaggregation axis selection, heterogeneous chip pool selection, Pareto frontier visualization, and heterogeneous vs. homogeneous comparison. Launch local preview via npm run ui.

5

Section 05

Validation and Use Cases

Validation: The project includes test/anchors.mjs (reproduce benchmark points), audit/AUDIT.md (audit records), and npm test (directory validation, etc.) to ensure model correctness. Use Cases: Hardware selection decisions (simulate chip combination performance), capacity planning (reverse-engineer hardware scale), architecture research (explore benefits of emerging disaggregation strategies), cost optimization (find the lowest cost under performance constraints or optimal performance within budget).

6

Section 06

Summary and Future Plans

Summary: disagg provides a rigorous and practical analysis framework for LLM inference optimization, helping developers move beyond the "stack GPUs" mindset to find the Pareto optimal balance of performance, cost, and latency. Project Status: Milestones completed include engine forking and auditing, sustained effective computing power conventions, MoE fixes, two-tier memory model, three disaggregation axes, and web UI. Future Plans: Calibrate d-Matrix models, support embedding/encoder disaggregation axes, and per-chip MFU calibration. Note: The chip catalog contains vendor-proprietary data; sensitive lines need to be cleaned before public release.