Reading

RecomputeOrMigrate: A Network-Aware KV Cache Recovery Scheduler for Disaggregated LLM Inference

RecomputeOrMigrate (RoM/KVRS) is a lightweight scheduler for disaggregated large language model (LLM) inference systems. It dynamically decides whether to migrate KV cache or recompute after a decoding GPU failure, optimizing the recovery strategy based on real-time network bandwidth and prompt length. Experiments show it can improve effective throughput by 8.6%.

分离式推理KV缓存故障恢复网络感知调度LLM servingDistServe高可用性分布式系统

Published 2026-05-04 13:08Recent activity 2026-05-04 13:24Estimated read 6 min

Section 01

Introduction: RecomputeOrMigrate—A Network-Aware KV Cache Recovery Scheduler for Disaggregated LLM Inference

RecomputeOrMigrate (abbreviated as RoM/KVRS) is a lightweight scheduler for disaggregated LLM inference systems, addressing the KV cache recovery decision problem after a decoding GPU failure. It dynamically chooses between migrating KV cache or recomputing based on real-time network bandwidth and prompt length. Experiments show it can improve effective throughput by 8.6%, providing new insights for the reliability design of disaggregated architectures.

Section 02

Background: The Rise of Disaggregated Architectures and Limitations of Static Strategies

Modern LLM serving systems adopt a prefill-decoding disaggregated architecture. The prefill phase processes prompts in batches to generate KV cache, while the decoding phase optimizes low-latency token generation. Systems like DistServe efficiently transfer KV cache via intra-node NVLink, but requests need to be reallocated when a decoding GPU fails. Existing static strategies (always migrate) ignore network dynamics: cross-node Ethernet bandwidth is only 10-100 Gbps (6-60 times lower than NVLink), so migration takes a long time during network congestion; meanwhile, the cost of recomputing varies with prompt length, and static strategies cannot adapt to this dynamic trade-off.

Section 03

Core Design of RoM: Network-Aware Adaptive Decision Mechanism

The core of RoM is a lightweight scheduler that dynamically evaluates recovery costs at the request level:

Migration cost: C_mig = KV cache size / measured bandwidth (Mbps)
Recomputation cost: C_recomp = prefill time (obtained via lookup table) Decision logic: Migrate if C_mig ≤ C_recomp, otherwise recompute. System components include: bandwidth monitor (EWMA smoothing estimation), node information exchange (gossip protocol for state collection), slot reservation (limiting concurrent migrations), and recovery scheduler (comprehensive decision-making). All complexity is isolated in the failure path, with zero overhead in the normal path.

Section 04

Experimental Validation: Performance Improvement and Model Accuracy

Tested on a two-node A100 cluster with OPT-13B:

Cross-point validation: The error between cost model prediction and actual measurement is <2%, proving the model's reliability;
Throughput improvement: In a 90% load skew scenario, RoM improves effective throughput by 8.6% compared to static strategies, and the performance of the healthy path is consistent with native DistServe;
Overhead analysis: Only lightweight probing during normal operation, decisions are completed in O(1) time during failures, and slot reservation ensures system stability.

Section 05

Technical Significance: Insights from Network Awareness and Failure Path Optimization

RoM verifies the importance of network dynamics in distributed AI systems, where the network should be treated as a first-class citizen; its design philosophy (isolating complexity to the failure path) improves system reliability without affecting healthy requests; the adaptive decision framework can be extended to other scenarios that require dynamic selection of execution paths.

Section 06

Limitations and Future Directions

RoM currently relies on the DistServe architecture, so adaptation to other disaggregated systems requires adjustments; experiments only used OPT-13B, so calibration for larger models (e.g., 70B) or multimodal models is needed; future explorations can include: integrating request priority, ML-driven cost prediction, extending to prefill-decoding load balancing, etc.

Section 07

Summary: Value of RoM and Engineering Practice Recommendations

RoM provides an elegant solution for failure recovery in disaggregated LLM inference, improving system efficiency with zero overhead on healthy paths through network-aware adaptive decision-making. Insights for engineers: Do not assume the network is statically sufficient; instead, make intelligent decisions based on real-time conditions—this philosophy applies to a wide range of distributed system optimizations.

RecomputeOrMigrate: A Network-Aware KV Cache Recovery Scheduler for Disaggregated LLM Inference

Introduction: RecomputeOrMigrate—A Network-Aware KV Cache Recovery Scheduler for Disaggregated LLM Inference

Background: The Rise of Disaggregated Architectures and Limitations of Static Strategies

Core Design of RoM: Network-Aware Adaptive Decision Mechanism

Experimental Validation: Performance Improvement and Model Accuracy

Technical Significance: Insights from Network Awareness and Failure Path Optimization

Limitations and Future Directions

Summary: Value of RoM and Engineering Practice Recommendations

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model