# disagg: A Tool for Disaggregation and Heterogeneity Exploration in Data Center LLM Inference

> An open-source tool for exploring disaggregation strategies and heterogeneous chip configurations in data center LLM inference. It supports multiple disaggregation axes such as prefill/decode separation, attention/expert separation, and speculative decoding, helping developers find the optimal Pareto frontier among throughput, interactivity, and cost.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-07T04:41:15.000Z
- 最近活动: 2026-06-07T04:51:28.300Z
- 热度: 145.8
- 关键词: LLM推理, 数据中心, 异构计算, 拆解策略, 预填充解码分离, MoE, 推测解码, 性能优化, 成本优化, GPU
- 页面链接: https://www.zingnex.cn/en/forum/thread/disagg
- Canonical: https://www.zingnex.cn/forum/thread/disagg
- Markdown 来源: floors_fallback

---

## [Introduction] disagg: A Tool for Disaggregation and Heterogeneity Exploration in Data Center LLM Inference

This article introduces the open-source tool disagg, which aims to explore disaggregation strategies and heterogeneous chip configurations in data center LLM inference. It supports multiple disaggregation axes including prefill/decode separation, attention/expert separation, and speculative decoding, helping developers find the optimal Pareto frontier among throughput, interactivity, and cost. The project is maintained by epsteinj, sourced from GitHub (link: https://github.com/epsteinj/disagg), and released on 2026-06-07T04:41:15Z.

## Project Background and Motivation

With the widespread deployment of LLMs in data centers, inference efficiency has become a key bottleneck for cost and user experience. Traditional homogeneous deployment models struggle to fully utilize the characteristics of different hardware and balance throughput, interaction latency, and cost per token. The disagg project is forked from the transformer_math tool and deeply extended to address its limitation of "not modeling heterogeneity", aiming to enable developers to explore the Pareto frontier under different chip combinations and disaggregation strategies.

## Core Features and Disaggregation Axes

disagg supports three disaggregation axes: 1. Prefill/Decode Separation: Assign prefill (compute-intensive) and decode (memory-access-intensive) tasks to different chip pools to optimize KV cache transfer; 2. Attention/Expert Separation: For MoE models, deploy attention layers (requiring high-bandwidth memory) and expert layers (requiring large-capacity memory) to different hardware; 3. Speculative Decoding: Separate the draft model and target model, with a built-in acceptance rate model to estimate a 2-3x speedup and evaluate the value of hardware investment.

## Technical Architecture and User Interface

**Technical Architecture**: The core engine is derived from transformer_math, including a chip performance catalog, model presets, FLOPs calculation, roofline model, and parallel strategy planner; enhancements include sustained effective computing power conventions (using actual MFU/bandwidth efficiency), MoE low-batch fix (solving over-prediction issues), and a two-tier memory model (supporting fast/cold memory tiering). **User Interface**: A self-contained web interface that supports disaggregation axis selection, heterogeneous chip pool selection, Pareto frontier visualization, and heterogeneous vs. homogeneous comparison. Launch local preview via `npm run ui`.

## Validation and Use Cases

**Validation**: The project includes test/anchors.mjs (reproduce benchmark points), audit/AUDIT.md (audit records), and `npm test` (directory validation, etc.) to ensure model correctness. **Use Cases**: Hardware selection decisions (simulate chip combination performance), capacity planning (reverse-engineer hardware scale), architecture research (explore benefits of emerging disaggregation strategies), cost optimization (find the lowest cost under performance constraints or optimal performance within budget).

## Summary and Future Plans

**Summary**: disagg provides a rigorous and practical analysis framework for LLM inference optimization, helping developers move beyond the "stack GPUs" mindset to find the Pareto optimal balance of performance, cost, and latency. **Project Status**: Milestones completed include engine forking and auditing, sustained effective computing power conventions, MoE fixes, two-tier memory model, three disaggregation axes, and web UI. **Future Plans**: Calibrate d-Matrix models, support embedding/encoder disaggregation axes, and per-chip MFU calibration. **Note**: The chip catalog contains vendor-proprietary data; sensitive lines need to be cleaned before public release.
