Zing Forum

Reading

Fault Tolerance and Recovery for Distributed LLM Inference Systems: Token-Commit-Resume Strategy Achieves 57% Recovery Time Optimization

This article deeply analyzes the CrashSafe project, a research-level distributed LLM inference fault tolerance system. By comparing three strategies—Fail-Stop, Retry-from-Scratch, and Token-Commit-Resume—the project achieves a 57% reduction in recovery time and 65% savings in computing resources in process failure scenarios.

分布式系统LLM推理容错机制故障恢复Token-Commit-ResumeFastAPI检查点计算优化
Published 2026-04-13 07:14Recent activity 2026-04-13 07:18Estimated read 5 min
Fault Tolerance and Recovery for Distributed LLM Inference Systems: Token-Commit-Resume Strategy Achieves 57% Recovery Time Optimization
1

Section 01

[Introduction] Core Achievements of CrashSafe: A Distributed LLM Inference Fault Tolerance System

This article introduces CrashSafe, a research-level distributed LLM inference fault tolerance system. By comparing three strategies—Fail-Stop, Retry-from-Scratch, and Token-Commit-Resume—it achieves a 57% reduction in recovery time and 65% savings in computing resources in process failure scenarios, providing an effective solution for reliability optimization in distributed LLM inference.

2

Section 02

Background and Challenges: Reliability Pain Points in Distributed LLM Inference

As LLM scales grow, distributed inference has become standard, but node crashes lead to request failures and resource waste of already generated tokens. Traditional strategies have flaws: Fail-Stop sacrifices availability, while Retry-from-Scratch causes massive computing waste—problems that are particularly prominent in compute-intensive LLM inference scenarios.

3

Section 03

CrashSafe Project and the Core Token-Commit-Resume Strategy

CrashSafe is a research system based on the FastAPI router-worker node architecture, implementing three fault tolerance strategies. The core innovation is the Token-Commit-Resume strategy: during generation, tokens are periodically persisted to a JSONL storage layer; in case of failure, recovery is done from the last checkpoint (loading completed tokens, and new nodes continue generation from the breakpoint), balancing reliability and efficiency. The system architecture includes: router layer (task distribution and failure detection), worker node layer (inference execution and failure injection), token storage layer (persistence), and backend abstraction layer (supports Mock/Transformers/vLLM backends).

4

Section 04

Comparative Analysis of Three Fault Tolerance Strategies

  1. Fail-Stop: Stops immediately upon failure and returns an error; minimal overhead but worst availability. 2. Retry-from-Scratch: Full retry; good reliability but extremely high computing waste. 3. Token-Commit-Resume: Incremental checkpoint recovery; reduces redundant computation, balancing availability and efficiency.
5

Section 05

Experimental Results: Performance Advantages of the Token-Commit-Resume Strategy

In process kill failure scenarios, the Token-Commit-Resume strategy performs significantly: 57% reduction in recovery time (compared to full retry), 65% reduction in computing waste, and acceptable checkpoint overhead during normal operation. Experiments cover different concurrency levels, failure types, and prompt lengths, verifying the strategy's effectiveness.

6

Section 06

Implementation Details and Practical Application Value

Implementation highlights: configuration-driven design (YAML to adjust system behavior), modular architecture (clear separation of responsibilities), and comprehensive testing system (smoke/unit/end-to-end tests). Application value: reduces cloud computing costs, improves user experience in real-time applications, and provides a reference for layered architecture.

7

Section 07

Limitations and Future Optimization Directions

Current limitations: JSONL storage may become a bottleneck under high concurrency. Future directions: optimize storage with Redis, dynamically adjust checkpoint intervals, and synchronize checkpoints across multiple replicas to prevent single points of failure in storage.