Zing Forum

Reading

Lodestar: An Online Learning-Based LLM Inference Request Routing System

This article introduces Lodestar, an LLM inference scheduling system that continuously optimizes request routing strategies via online learning. In public cloud GPU cluster experiments, it reduces the average Time To First Token (TTFT) by 1.41x compared to state-of-the-art (SOTA) heuristic methods and can learn an efficient routing strategy in approximately 5 minutes.

LLM推理服务请求路由在线学习Lodestar负载均衡GPU集群调度
Published 2026-05-31 09:31Recent activity 2026-06-02 10:54Estimated read 10 min
Lodestar: An Online Learning-Based LLM Inference Request Routing System
1

Section 01

Lodestar: Guide to the Online Learning-Based LLM Inference Request Routing System

Lodestar: An Online Learning-Based LLM Inference Request Routing System

This article introduces the intelligent routing system proposed in the arXiv paper Lodestar: An Online-Learning LLM Inference Router, which aims to solve the request allocation problem in LLM inference services. Key highlights:

  • Problem Identification: Traditional load balancing methods cannot handle the complex characteristics of LLM inference, such as input dependency, batch processing/KV cache coupling, and non-linear latency.
  • Solution: Continuously optimize routing strategies through online learning to adapt to dynamic workloads and infrastructure changes.
  • Key Results: In public cloud GPU cluster experiments, it reduces the average TTFT by 1.41x compared to SOTA heuristic methods and can learn an efficient strategy in about 5 minutes.
  • Source Information: Paper link http://arxiv.org/abs/2606.00946v1, published on May 31, 2026.
2

Section 02

Core Challenges of LLM Inference Request Routing and Limitations of Traditional Methods

Unique Complexity of LLM Inference Routing

LLM inference request routing faces three major challenges:

  1. Input-Dependent Execution Characteristics: The latency difference between short prompts and long-context requests is huge, making historical average predictions unreliable.
  2. Batch Processing and KV Cache Coupling: Continuous batch processing and prefix caching lead to cross-request coupling, so optimal request allocation needs to consider the batch status and cache reuse of existing instances.
  3. Non-Linear Latency Response: Factors such as context length (quadratic complexity), model configuration, and hardware heterogeneity result in non-linear changes in latency.

Shortcomings of Traditional Methods

  • Traditional Load Balancing Algorithms: Round-robin, least connections, etc., ignore request characteristics and instance state heterogeneity, leading to poor performance.
  • LLM-Specific Heuristics: Prefix cache-aware, load-aware, and other rules have limitations such as staticity (inability to adapt to dynamic changes), local optimality, and difficulty in combining and tuning.
3

Section 03

Lodestar System Architecture and Core Components

Lodestar's Perceive-Learn-Decide Closed-Loop Architecture

Lodestar adopts a perceive-learn-decide closed-loop architecture with core components including:

  1. Real-Time State Collector: Continuously collects instance-level (load, KV cache, queue length), request-level (input/output length, prefix matching), and performance observation (TTFT, TPOT) data.
  2. Online Reward Predictor: A core innovation that uses an online learning model to estimate the reward (e.g., TTFT reduction) of routing a request to a certain instance, supporting multi-objective optimization.
  3. Routing Decider: Selects the instance with the highest reward to forward the request.

Cloud-Native Design

  • Deployed in sidecar mode, no need to modify the code of inference engines like vLLM.
  • Standard HTTP/gRPC interfaces, supporting horizontal scaling.
4

Section 04

Experimental Results: Significant Performance Improvements

Comparison with SOTA Heuristics

Experimental results in public cloud GPU clusters:

Cluster Type Average TTFT Improvement P99 TTFT Improvement
Homogeneous 2.15x 1.86x
Heterogeneous 4.38x 4.42x
Average 1.41x 1.47x

Fast Learning Feature

Lodestar can learn an efficient strategy in about 5 minutes, with low startup cost and quick adaptation to changes.

Advantages in Heterogeneous Clusters

The improvement is more significant in heterogeneous clusters (different generations of GPUs), as online learning can automatically match hardware characteristics with request features.

5

Section 05

Key Mechanisms for the Effectiveness of Online Learning

  1. Capturing Non-Linear Interactions: Neural network models can capture complex non-linear relationships between request features and instance states (e.g., long-context requests have reduced latency due to cache hits).
  2. Adapting to Workload Drift: Continuous learning handles temporal pattern changes (day/night, weekdays/weekends, burst traffic).
  3. Balancing Exploration and Exploitation: Achieves a balance between known optimal strategies and exploration of new strategies through ε-greedy policies, uncertainty estimation, and progressive updates.
6

Section 06

Production Deployment Considerations and Best Practices

Data Collection Overhead

  • Asynchronous sampling to avoid blocking the request path.
  • Reasonable sampling rate to balance data quality and overhead.
  • Use eBPF to reduce kernel-mode data collection costs.

Model Training Resources

  • Use lightweight models (e.g., small MLPs).
  • Incremental updates instead of full retraining.
  • Run learning components in independent processes without affecting inference services.

Cold Start and Fallback

  • Fall back to heuristic strategies when data is insufficient.
  • Monitor prediction confidence and increase exploration when confidence is low.

Multi-Objective Optimization

  • Train dedicated predictors for different objectives (average latency, tail latency, throughput).
  • Weight parameters to balance objectives, supporting runtime switching.
7

Section 07

Implications for LLM Service Architecture and Future Directions

Architectural Implications

  1. Paradigm Shift: From manual heuristics to data-driven online learning.
  2. Value of Online Adaptation: Static strategies are difficult to handle dynamic environments; online learning is an elegant solution.
  3. System-Level Optimization Space: Besides model optimization, there is huge potential for scheduling layer optimization.

Limitations and Future Directions

  • Limitations: Single-objective optimization, lack of global request sequence planning, limited generalization ability for new requests.
  • Future Directions: Multi-objective reinforcement learning, global scheduling algorithms, cross-cluster routing optimization, integration of model prediction and system feedback.