# RecomputeOrMigrate: A Network-Aware KV Cache Recovery Scheduler for Disaggregated LLM Inference

> RecomputeOrMigrate (RoM/KVRS) is a lightweight scheduler for disaggregated large language model (LLM) inference systems. It dynamically decides whether to migrate KV cache or recompute after a decoding GPU failure, optimizing the recovery strategy based on real-time network bandwidth and prompt length. Experiments show it can improve effective throughput by 8.6%.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-04T05:08:43.000Z
- 最近活动: 2026-05-04T05:24:36.109Z
- 热度: 150.7
- 关键词: 分离式推理, KV缓存, 故障恢复, 网络感知调度, LLM serving, DistServe, 高可用性, 分布式系统
- 页面链接: https://www.zingnex.cn/en/forum/thread/recomputeormigrate-llmkv
- Canonical: https://www.zingnex.cn/forum/thread/recomputeormigrate-llmkv
- Markdown 来源: floors_fallback

---

## Introduction: RecomputeOrMigrate—A Network-Aware KV Cache Recovery Scheduler for Disaggregated LLM Inference

RecomputeOrMigrate (abbreviated as RoM/KVRS) is a lightweight scheduler for disaggregated LLM inference systems, addressing the KV cache recovery decision problem after a decoding GPU failure. It dynamically chooses between migrating KV cache or recomputing based on real-time network bandwidth and prompt length. Experiments show it can improve effective throughput by 8.6%, providing new insights for the reliability design of disaggregated architectures.

## Background: The Rise of Disaggregated Architectures and Limitations of Static Strategies

Modern LLM serving systems adopt a prefill-decoding disaggregated architecture. The prefill phase processes prompts in batches to generate KV cache, while the decoding phase optimizes low-latency token generation. Systems like DistServe efficiently transfer KV cache via intra-node NVLink, but requests need to be reallocated when a decoding GPU fails. Existing static strategies (always migrate) ignore network dynamics: cross-node Ethernet bandwidth is only 10-100 Gbps (6-60 times lower than NVLink), so migration takes a long time during network congestion; meanwhile, the cost of recomputing varies with prompt length, and static strategies cannot adapt to this dynamic trade-off.

## Core Design of RoM: Network-Aware Adaptive Decision Mechanism

The core of RoM is a lightweight scheduler that dynamically evaluates recovery costs at the request level:
- **Migration cost**: C_mig = KV cache size / measured bandwidth (Mbps)
- **Recomputation cost**: C_recomp = prefill time (obtained via lookup table)
Decision logic: Migrate if C_mig ≤ C_recomp, otherwise recompute. System components include: bandwidth monitor (EWMA smoothing estimation), node information exchange (gossip protocol for state collection), slot reservation (limiting concurrent migrations), and recovery scheduler (comprehensive decision-making). All complexity is isolated in the failure path, with zero overhead in the normal path.

## Experimental Validation: Performance Improvement and Model Accuracy

Tested on a two-node A100 cluster with OPT-13B:
1. **Cross-point validation**: The error between cost model prediction and actual measurement is <2%, proving the model's reliability;
2. **Throughput improvement**: In a 90% load skew scenario, RoM improves effective throughput by 8.6% compared to static strategies, and the performance of the healthy path is consistent with native DistServe;
3. **Overhead analysis**: Only lightweight probing during normal operation, decisions are completed in O(1) time during failures, and slot reservation ensures system stability.

## Technical Significance: Insights from Network Awareness and Failure Path Optimization

RoM verifies the importance of network dynamics in distributed AI systems, where the network should be treated as a first-class citizen; its design philosophy (isolating complexity to the failure path) improves system reliability without affecting healthy requests; the adaptive decision framework can be extended to other scenarios that require dynamic selection of execution paths.

## Limitations and Future Directions

RoM currently relies on the DistServe architecture, so adaptation to other disaggregated systems requires adjustments; experiments only used OPT-13B, so calibration for larger models (e.g., 70B) or multimodal models is needed; future explorations can include: integrating request priority, ML-driven cost prediction, extending to prefill-decoding load balancing, etc.

## Summary: Value of RoM and Engineering Practice Recommendations

RoM provides an elegant solution for failure recovery in disaggregated LLM inference, improving system efficiency with zero overhead on healthy paths through network-aware adaptive decision-making. Insights for engineers: Do not assume the network is statically sufficient; instead, make intelligent decisions based on real-time conditions—this philosophy applies to a wide range of distributed system optimizations.