Reading

disagg: A Tool for Disaggregation and Heterogeneity Exploration in Data Center LLM Inference

An open-source tool for exploring disaggregation strategies and heterogeneous chip configurations in data center LLM inference. It supports multiple disaggregation axes such as prefill/decode separation, attention/expert separation, and speculative decoding, helping developers find the optimal Pareto frontier among throughput, interactivity, and cost.

LLM推理数据中心异构计算拆解策略预填充解码分离MoE推测解码性能优化成本优化GPU

Published 2026-06-07 12:41Recent activity 2026-06-07 12:51Estimated read 6 min

Section 01

[Introduction] disagg: A Tool for Disaggregation and Heterogeneity Exploration in Data Center LLM Inference

This article introduces the open-source tool disagg, which aims to explore disaggregation strategies and heterogeneous chip configurations in data center LLM inference. It supports multiple disaggregation axes including prefill/decode separation, attention/expert separation, and speculative decoding, helping developers find the optimal Pareto frontier among throughput, interactivity, and cost. The project is maintained by epsteinj, sourced from GitHub (link: https://github.com/epsteinj/disagg), and released on 2026-06-07T04:41:15Z.

Section 02

Project Background and Motivation

With the widespread deployment of LLMs in data centers, inference efficiency has become a key bottleneck for cost and user experience. Traditional homogeneous deployment models struggle to fully utilize the characteristics of different hardware and balance throughput, interaction latency, and cost per token. The disagg project is forked from the transformer_math tool and deeply extended to address its limitation of "not modeling heterogeneity", aiming to enable developers to explore the Pareto frontier under different chip combinations and disaggregation strategies.

Section 03

Core Features and Disaggregation Axes

disagg supports three disaggregation axes: 1. Prefill/Decode Separation: Assign prefill (compute-intensive) and decode (memory-access-intensive) tasks to different chip pools to optimize KV cache transfer; 2. Attention/Expert Separation: For MoE models, deploy attention layers (requiring high-bandwidth memory) and expert layers (requiring large-capacity memory) to different hardware; 3. Speculative Decoding: Separate the draft model and target model, with a built-in acceptance rate model to estimate a 2-3x speedup and evaluate the value of hardware investment.

Section 04

Technical Architecture and User Interface

Technical Architecture: The core engine is derived from transformer_math, including a chip performance catalog, model presets, FLOPs calculation, roofline model, and parallel strategy planner; enhancements include sustained effective computing power conventions (using actual MFU/bandwidth efficiency), MoE low-batch fix (solving over-prediction issues), and a two-tier memory model (supporting fast/cold memory tiering). User Interface: A self-contained web interface that supports disaggregation axis selection, heterogeneous chip pool selection, Pareto frontier visualization, and heterogeneous vs. homogeneous comparison. Launch local preview via npm run ui.

Section 05

Validation and Use Cases

Validation: The project includes test/anchors.mjs (reproduce benchmark points), audit/AUDIT.md (audit records), and npm test (directory validation, etc.) to ensure model correctness. Use Cases: Hardware selection decisions (simulate chip combination performance), capacity planning (reverse-engineer hardware scale), architecture research (explore benefits of emerging disaggregation strategies), cost optimization (find the lowest cost under performance constraints or optimal performance within budget).

Section 06

Summary and Future Plans

Summary: disagg provides a rigorous and practical analysis framework for LLM inference optimization, helping developers move beyond the "stack GPUs" mindset to find the Pareto optimal balance of performance, cost, and latency. Project Status: Milestones completed include engine forking and auditing, sustained effective computing power conventions, MoE fixes, two-tier memory model, three disaggregation axes, and web UI. Future Plans: Calibrate d-Matrix models, support embedding/encoder disaggregation axes, and per-chip MFU calibration. Note: The chip catalog contains vendor-proprietary data; sensitive lines need to be cleaned before public release.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49