# GoodServe: A High-Throughput Service System for Agentic LLM Inference on Heterogeneous GPUs

> This article introduces the GoodServe system, which achieves high-throughput service for Agentic LLM inference on heterogeneous GPU clusters through prediction-correction routing strategy, accurate output length estimation, and runtime request migration, improving goodput by 27.4% compared to existing methods.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-16T08:01:12.000Z
- 最近活动: 2026-05-19T02:21:39.901Z
- 热度: 91.7
- 关键词: LLM推理服务, 异构GPU, Agentic应用, Goodput优化, 请求路由, 动态迁移, SLO满足率
- 页面链接: https://www.zingnex.cn/en/forum/thread/goodserve-gpuagentic-llm
- Canonical: https://www.zingnex.cn/forum/thread/goodserve-gpuagentic-llm
- Markdown 来源: floors_fallback

---

## Introduction: GoodServe—A High-Goodput Service System for Agentic LLM Inference on Heterogeneous GPUs

This article introduces the GoodServe system, which aims to solve the scheduling problem of Agentic LLM inference services in heterogeneous GPU clusters. Through three core technologies—prediction-correction routing strategy, accurate output length estimation, and runtime request migration—it achieves a significant improvement in the proportion of requests meeting SLO (Goodput), with an average increase of 27.4% compared to existing methods.

## New Challenges of Agentic LLM Inference and Background of Heterogeneous GPUs

With the popularization of LLMs in Agentic applications, the demand for inference services has changed: Agentic applications involve multi-step workflows (planning, tool calling, etc.), and user experience depends on end-to-end latency rather than single-step responses. Meanwhile, inference infrastructure is moving toward heterogeneity, with resource pools mixing GPUs of different generations (A100/H100/H200, etc.), and devices differ significantly in computing power, memory capacity, and bandwidth—how to schedule efficiently has become a key issue.

## Core Metric: Definition and Significance of Goodput

Goodput is different from traditional Throughput (number of requests processed); it measures the **proportion of requests that meet the Service Level Objective (SLO)**. For Agentic applications, SLO is usually an end-to-end latency upper limit (e.g., a customer service Agent requires 90% of requests to be completed within 2 seconds). The goal of GoodServe is to maximize this proportion, rather than simply pursuing high concurrency.

## GoodServe System Architecture: Prediction-Correction Routing Paradigm

GoodServe adopts a prediction-correction routing strategy, which includes three parts:

### Prediction Module
- **Output Length Prediction**: A lightweight predictor estimates the number of output tokens for requests, providing input for scheduling;
- **GPU State Estimation**: Real-time tracking of queue length, memory usage, utilization, KV cache pressure, etc.

### Routing Decision
Adopts a "just enough" strategy: no over-allocation of high-spec GPUs, no under-allocation of resources, load balancing, balancing SLO and resource efficiency.

### Dynamic Migration
- **SLO Risk Monitoring**: Periodically assess the risk of request timeout;
- **Migration Mechanism**: Migrate high-risk requests to appropriate instances, considering KV cache, target capacity, migration overhead, and remaining workload.

## Heterogeneous Resource Modeling and Phase-Aware Scheduling

#### Device Capability Profiling
Performance characteristics of different GPU types:
| GPU Type | Computing Power | Memory Capacity | Application Scenario |
|---------|----------------|----------------|---------------------|
| A100 | Baseline | 40/80GB | General Inference |
| H100 | 2-3x A100 | 80GB | Large Models/High Concurrency |
| H200 | Similar to H100 | 141GB | Long Context/Large KV Cache |

#### Phase-Aware Scheduling
LLM inference is divided into Prefill (computation-intensive, high parallelism) and Decode (memory-intensive, autoregressive) phases. GoodServe routes these two phases to the most suitable GPU instances respectively.

## Experimental Evaluation: Goodput Improvement and Key Insights

Evaluation results on a heterogeneous A100/H100/H200 cluster:
- Average goodput improvement of 27.4%;
- Under 95% SLO requirement, the required SLO scale is reduced by 20.1%;
- Under 99% SLO requirement, the required SLO scale is reduced by 33.0%;
- The best-case improvement reaches 45.0% (95% SLO) and 80.5% (99% SLO).

Key Insights:
1. Prediction accuracy directly affects routing quality;
2. Dynamic migration, although with overhead, significantly improves SLO satisfaction rate;
3. Heterogeneity-aware strategies are better than uniform treatment methods.

## Practical Deployment Value of GoodServe

#### Cost Optimization
- Serve more users with the same hardware;
- Reduce GPU procurement when meeting the same service level;
- Fully utilize heterogeneous devices.

#### User Experience Improvement
- More stable response time;
- Fewer timeouts and retries;
- Smooth Agentic interaction.

#### Progressive Deployment
- Modular design, allowing gradual introduction of features;
- Compatible with existing frameworks (vLLM, TensorRT-LLM);
- No need to modify models or training processes.

## Limitations and Future Directions

GoodServe still has room for improvement:
- **Prediction Model**: Currently uses heuristics; future can explore learning-based predictors;
- **Global Optimization**: Greedy strategy is not globally optimal; need to study NP-hard problems;
- **Multi-Tenant Scenario**: Experiments are single-tenant; need to consider isolation and fairness;
- **Model Heterogeneity**: Future expansion to different-sized models serving the same application.
