Zing Forum

Reading

GoodServe: A High-Throughput Service System for Agentic LLM Inference on Heterogeneous GPUs

This article introduces the GoodServe system, which achieves high-throughput service for Agentic LLM inference on heterogeneous GPU clusters through prediction-correction routing strategy, accurate output length estimation, and runtime request migration, improving goodput by 27.4% compared to existing methods.

LLM推理服务异构GPUAgentic应用Goodput优化请求路由动态迁移SLO满足率
Published 2026-05-16 16:01Recent activity 2026-05-19 10:21Estimated read 8 min
GoodServe: A High-Throughput Service System for Agentic LLM Inference on Heterogeneous GPUs
1

Section 01

Introduction: GoodServe—A High-Goodput Service System for Agentic LLM Inference on Heterogeneous GPUs

This article introduces the GoodServe system, which aims to solve the scheduling problem of Agentic LLM inference services in heterogeneous GPU clusters. Through three core technologies—prediction-correction routing strategy, accurate output length estimation, and runtime request migration—it achieves a significant improvement in the proportion of requests meeting SLO (Goodput), with an average increase of 27.4% compared to existing methods.

2

Section 02

New Challenges of Agentic LLM Inference and Background of Heterogeneous GPUs

With the popularization of LLMs in Agentic applications, the demand for inference services has changed: Agentic applications involve multi-step workflows (planning, tool calling, etc.), and user experience depends on end-to-end latency rather than single-step responses. Meanwhile, inference infrastructure is moving toward heterogeneity, with resource pools mixing GPUs of different generations (A100/H100/H200, etc.), and devices differ significantly in computing power, memory capacity, and bandwidth—how to schedule efficiently has become a key issue.

3

Section 03

Core Metric: Definition and Significance of Goodput

Goodput is different from traditional Throughput (number of requests processed); it measures the proportion of requests that meet the Service Level Objective (SLO). For Agentic applications, SLO is usually an end-to-end latency upper limit (e.g., a customer service Agent requires 90% of requests to be completed within 2 seconds). The goal of GoodServe is to maximize this proportion, rather than simply pursuing high concurrency.

4

Section 04

GoodServe System Architecture: Prediction-Correction Routing Paradigm

GoodServe adopts a prediction-correction routing strategy, which includes three parts:

Prediction Module

  • Output Length Prediction: A lightweight predictor estimates the number of output tokens for requests, providing input for scheduling;
  • GPU State Estimation: Real-time tracking of queue length, memory usage, utilization, KV cache pressure, etc.

Routing Decision

Adopts a "just enough" strategy: no over-allocation of high-spec GPUs, no under-allocation of resources, load balancing, balancing SLO and resource efficiency.

Dynamic Migration

  • SLO Risk Monitoring: Periodically assess the risk of request timeout;
  • Migration Mechanism: Migrate high-risk requests to appropriate instances, considering KV cache, target capacity, migration overhead, and remaining workload.
5

Section 05

Heterogeneous Resource Modeling and Phase-Aware Scheduling

Device Capability Profiling

Performance characteristics of different GPU types:

GPU Type Computing Power Memory Capacity Application Scenario
A100 Baseline 40/80GB General Inference
H100 2-3x A100 80GB Large Models/High Concurrency
H200 Similar to H100 141GB Long Context/Large KV Cache

Phase-Aware Scheduling

LLM inference is divided into Prefill (computation-intensive, high parallelism) and Decode (memory-intensive, autoregressive) phases. GoodServe routes these two phases to the most suitable GPU instances respectively.

6

Section 06

Experimental Evaluation: Goodput Improvement and Key Insights

Evaluation results on a heterogeneous A100/H100/H200 cluster:

  • Average goodput improvement of 27.4%;
  • Under 95% SLO requirement, the required SLO scale is reduced by 20.1%;
  • Under 99% SLO requirement, the required SLO scale is reduced by 33.0%;
  • The best-case improvement reaches 45.0% (95% SLO) and 80.5% (99% SLO).

Key Insights:

  1. Prediction accuracy directly affects routing quality;
  2. Dynamic migration, although with overhead, significantly improves SLO satisfaction rate;
  3. Heterogeneity-aware strategies are better than uniform treatment methods.
7

Section 07

Practical Deployment Value of GoodServe

Cost Optimization

  • Serve more users with the same hardware;
  • Reduce GPU procurement when meeting the same service level;
  • Fully utilize heterogeneous devices.

User Experience Improvement

  • More stable response time;
  • Fewer timeouts and retries;
  • Smooth Agentic interaction.

Progressive Deployment

  • Modular design, allowing gradual introduction of features;
  • Compatible with existing frameworks (vLLM, TensorRT-LLM);
  • No need to modify models or training processes.
8

Section 08

Limitations and Future Directions

GoodServe still has room for improvement:

  • Prediction Model: Currently uses heuristics; future can explore learning-based predictors;
  • Global Optimization: Greedy strategy is not globally optimal; need to study NP-hard problems;
  • Multi-Tenant Scenario: Experiments are single-tenant; need to consider isolation and fairness;
  • Model Heterogeneity: Future expansion to different-sized models serving the same application.