# Decoupling Limits in MoE Large Model Inference: A Design Space Exploration of Attention-FFN Disaggregation

> Through systematic design space exploration, this study analyzes the benefit boundaries of various decoupling strategies (from chunked-prefill to prefill-decode and then to Attention-FFN Disaggregation) in MoE model serving, providing practical guidance for the design of large-scale inference infrastructure.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-27T10:55:57.000Z
- 最近活动: 2026-05-28T01:51:36.517Z
- 热度: 136.1
- 关键词: 大模型推理, MoE, 解耦架构, Attention-FFN, TTFT, TPOT, DeepSeek, 系统优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/moe-attention-ffn
- Canonical: https://www.zingnex.cn/forum/thread/moe-attention-ffn
- Markdown 来源: floors_fallback

---

## [Introduction] Exploring Decoupling Limits in MoE Large Model Inference: A Study on the Design Space of Attention-FFN Disaggregation

The original author team (arXiv submission) published the paper titled "How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving" on arXiv on May 27, 2026 (link: http://arxiv.org/abs/2605.28302v1). Through systematic design space exploration, this study analyzes the benefit boundaries of various decoupling strategies (from Chunked-Prefill to Prefill-Decode and then to Attention-FFN Disaggregation (AFD)) in MoE model serving, providing practical guidance for the design of large-scale inference infrastructure. Key findings include: AFD has significant advantages under strict latency constraints, and its benefits depend on the matching of workload characteristics, resource allocation, and interconnection topology.

## Background: Decoupling Evolution in Large Model Inference and Special Challenges of MoE

Modern large model inference systems are evolving from centralized to decentralized architectures, in three stages:
1. Chunked-Prefill aggregation: Long sequence prefill is split into multiple chunks and processed sequentially on a single GPU to balance efficiency and memory;
2. Prefill-Decode (P/D) decoupling: Separate compute-intensive prefill and memory-intensive decode into different GPU groups;
3. Operator-level AFD: Further separate Attention layers and FFN layers into different GPU groups.
MoE models pose special challenges: Attention layers are memory-intensive (limited by KV cache bandwidth), FFN layers are compute-intensive (matrix calculations for activated experts), and scheduling communication (additional overhead of dispatch/combine) — the coupling of these three increases optimization difficulty.

## Methodology: Attention-FFN Disaggregation (AFD) Architecture and Evaluation Framework

**AFD Architecture**: Separate Attention computation and FFN computation into different GPU groups —
- Attention GPU group: Optimize KV cache management and memory access efficiency;
- FFN GPU group: Fully utilize computing resources;
The two groups transfer activation values via high-speed interconnection (e.g., NVLink), allowing components to run on suitable hardware.
**Evaluation Framework**: Integrate device kernel measurement and high-fidelity network simulation, covering dimensions such as input/output sequence length combinations, changes in Prefix-KV reuse rate, and per-user latency constraints (TTFT/TPOT SLO).

## Evidence: Benefit Boundaries of Decoupling Strategies and Key Findings

Key Findings:
1. **Quantifiable Benefit Boundaries**: There are clear boundaries for the benefits of different decoupling levels under different workloads;
2. **AFD's Significant Advantages Under Strict SLO**: On DeepSeek-V3.2, AFD achieves a system throughput of approximately 4000 tokens/s, while non-AFD deployment is infeasible in this scenario;
3. **Workload Determines Optimal Configuration**: Scenarios such as chat, code generation, and agentic programming have large differences in sensitivity to decoupling strategies.

## Design Principles: Applicable Scenarios for Hierarchical Decoupling and Resource Allocation Strategies

Design Principles:
1. **Applicable Scenarios for Hierarchical Decoupling**:
   - Chunked-Prefill: Basic optimization for long input sequence scenarios;
   - P/D decoupling: Significant benefits when there is a large difference in latency requirements between prefill and decode;
   - AFD: Necessary when strict latency SLO and high throughput requirements coexist;
2. **GPU Partitioning Strategy**: Dynamically adjust the ratio — increase the proportion of Attention GPUs in long-context/high KV reuse scenarios, and increase the proportion of FFN GPUs in high batch size/expert activation-intensive scenarios;
3. **Importance of Interconnection Topology**: Decoupled architectures are sensitive to inter-GPU bandwidth; rack-level NVLink and cluster-level RDMA topologies directly affect benefits.

## Implications: Guidance for Large Model Inference Infrastructure Construction

Implications for Infrastructure:
- **Current Deployment**: Provide specific configuration recommendations for rack/cluster-level deployment to avoid blind decoupling;
- **Future Architecture**: Directions for next-generation decoupled AI infrastructure — finer-grained resource pooling, flexible scheduling, and high-speed interconnection;
- **Cost-Effectiveness**: Help decision-makers balance hardware investment and performance requirements.

## Conclusion: Value of Decoupling and Practical Key Points

This study answers the key question of "how far can decoupling go": AFD can bring significant benefits in MoE model serving (especially under strict latency constraints), but the benefits depend on the careful matching of workload characteristics, resource allocation strategies, and interconnection topology. For teams building or optimizing large model inference infrastructure, these findings provide valuable practical guidance.
