Zing Forum

Reading

RecomputeOrMigrate: A Network-Aware KV Cache Recovery Scheduler for Disaggregated LLM Inference

RecomputeOrMigrate (RoM/KVRS) is a lightweight scheduler for disaggregated large language model (LLM) inference systems. It dynamically decides whether to migrate KV cache or recompute after a decoding GPU failure, optimizing the recovery strategy based on real-time network bandwidth and prompt length. Experiments show it can improve effective throughput by 8.6%.

分离式推理KV缓存故障恢复网络感知调度LLM servingDistServe高可用性分布式系统
Published 2026-05-04 13:08Recent activity 2026-05-04 13:24Estimated read 6 min
RecomputeOrMigrate: A Network-Aware KV Cache Recovery Scheduler for Disaggregated LLM Inference
1

Section 01

Introduction: RecomputeOrMigrate—A Network-Aware KV Cache Recovery Scheduler for Disaggregated LLM Inference

RecomputeOrMigrate (abbreviated as RoM/KVRS) is a lightweight scheduler for disaggregated LLM inference systems, addressing the KV cache recovery decision problem after a decoding GPU failure. It dynamically chooses between migrating KV cache or recomputing based on real-time network bandwidth and prompt length. Experiments show it can improve effective throughput by 8.6%, providing new insights for the reliability design of disaggregated architectures.

2

Section 02

Background: The Rise of Disaggregated Architectures and Limitations of Static Strategies

Modern LLM serving systems adopt a prefill-decoding disaggregated architecture. The prefill phase processes prompts in batches to generate KV cache, while the decoding phase optimizes low-latency token generation. Systems like DistServe efficiently transfer KV cache via intra-node NVLink, but requests need to be reallocated when a decoding GPU fails. Existing static strategies (always migrate) ignore network dynamics: cross-node Ethernet bandwidth is only 10-100 Gbps (6-60 times lower than NVLink), so migration takes a long time during network congestion; meanwhile, the cost of recomputing varies with prompt length, and static strategies cannot adapt to this dynamic trade-off.

3

Section 03

Core Design of RoM: Network-Aware Adaptive Decision Mechanism

The core of RoM is a lightweight scheduler that dynamically evaluates recovery costs at the request level:

  • Migration cost: C_mig = KV cache size / measured bandwidth (Mbps)
  • Recomputation cost: C_recomp = prefill time (obtained via lookup table) Decision logic: Migrate if C_mig ≤ C_recomp, otherwise recompute. System components include: bandwidth monitor (EWMA smoothing estimation), node information exchange (gossip protocol for state collection), slot reservation (limiting concurrent migrations), and recovery scheduler (comprehensive decision-making). All complexity is isolated in the failure path, with zero overhead in the normal path.
4

Section 04

Experimental Validation: Performance Improvement and Model Accuracy

Tested on a two-node A100 cluster with OPT-13B:

  1. Cross-point validation: The error between cost model prediction and actual measurement is <2%, proving the model's reliability;
  2. Throughput improvement: In a 90% load skew scenario, RoM improves effective throughput by 8.6% compared to static strategies, and the performance of the healthy path is consistent with native DistServe;
  3. Overhead analysis: Only lightweight probing during normal operation, decisions are completed in O(1) time during failures, and slot reservation ensures system stability.
5

Section 05

Technical Significance: Insights from Network Awareness and Failure Path Optimization

RoM verifies the importance of network dynamics in distributed AI systems, where the network should be treated as a first-class citizen; its design philosophy (isolating complexity to the failure path) improves system reliability without affecting healthy requests; the adaptive decision framework can be extended to other scenarios that require dynamic selection of execution paths.

6

Section 06

Limitations and Future Directions

RoM currently relies on the DistServe architecture, so adaptation to other disaggregated systems requires adjustments; experiments only used OPT-13B, so calibration for larger models (e.g., 70B) or multimodal models is needed; future explorations can include: integrating request priority, ML-driven cost prediction, extending to prefill-decoding load balancing, etc.

7

Section 07

Summary: Value of RoM and Engineering Practice Recommendations

RoM provides an elegant solution for failure recovery in disaggregated LLM inference, improving system efficiency with zero overhead on healthy paths through network-aware adaptive decision-making. Insights for engineers: Do not assume the network is statically sufficient; instead, make intelligent decisions based on real-time conditions—this philosophy applies to a wide range of distributed system optimizations.