Reading

GoodServe: A High-Throughput Service System for Agentic LLM Inference on Heterogeneous GPUs

This article introduces the GoodServe system, which achieves high-throughput service for Agentic LLM inference on heterogeneous GPU clusters through prediction-correction routing strategy, accurate output length estimation, and runtime request migration, improving goodput by 27.4% compared to existing methods.

LLM推理服务异构GPUAgentic应用Goodput优化请求路由动态迁移SLO满足率

Published 2026-05-16 16:01Recent activity 2026-05-19 10:21Estimated read 8 min

GoodServe: A High-Throughput Service System for Agentic LLM Inference on Heterogeneous GPUs

Section 01

Introduction: GoodServe—A High-Goodput Service System for Agentic LLM Inference on Heterogeneous GPUs

This article introduces the GoodServe system, which aims to solve the scheduling problem of Agentic LLM inference services in heterogeneous GPU clusters. Through three core technologies—prediction-correction routing strategy, accurate output length estimation, and runtime request migration—it achieves a significant improvement in the proportion of requests meeting SLO (Goodput), with an average increase of 27.4% compared to existing methods.

Section 02

New Challenges of Agentic LLM Inference and Background of Heterogeneous GPUs

With the popularization of LLMs in Agentic applications, the demand for inference services has changed: Agentic applications involve multi-step workflows (planning, tool calling, etc.), and user experience depends on end-to-end latency rather than single-step responses. Meanwhile, inference infrastructure is moving toward heterogeneity, with resource pools mixing GPUs of different generations (A100/H100/H200, etc.), and devices differ significantly in computing power, memory capacity, and bandwidth—how to schedule efficiently has become a key issue.

Section 03

Core Metric: Definition and Significance of Goodput

Goodput is different from traditional Throughput (number of requests processed); it measures the proportion of requests that meet the Service Level Objective (SLO). For Agentic applications, SLO is usually an end-to-end latency upper limit (e.g., a customer service Agent requires 90% of requests to be completed within 2 seconds). The goal of GoodServe is to maximize this proportion, rather than simply pursuing high concurrency.

Section 04

GoodServe System Architecture: Prediction-Correction Routing Paradigm

GoodServe adopts a prediction-correction routing strategy, which includes three parts:

Prediction Module

Output Length Prediction: A lightweight predictor estimates the number of output tokens for requests, providing input for scheduling;
GPU State Estimation: Real-time tracking of queue length, memory usage, utilization, KV cache pressure, etc.

Routing Decision

Adopts a "just enough" strategy: no over-allocation of high-spec GPUs, no under-allocation of resources, load balancing, balancing SLO and resource efficiency.

Dynamic Migration

SLO Risk Monitoring: Periodically assess the risk of request timeout;
Migration Mechanism: Migrate high-risk requests to appropriate instances, considering KV cache, target capacity, migration overhead, and remaining workload.

Section 05

Heterogeneous Resource Modeling and Phase-Aware Scheduling

Device Capability Profiling

Performance characteristics of different GPU types:

GPU Type	Computing Power	Memory Capacity	Application Scenario
A100	Baseline	40/80GB	General Inference
H100	2-3x A100	80GB	Large Models/High Concurrency
H200	Similar to H100	141GB	Long Context/Large KV Cache

Phase-Aware Scheduling

LLM inference is divided into Prefill (computation-intensive, high parallelism) and Decode (memory-intensive, autoregressive) phases. GoodServe routes these two phases to the most suitable GPU instances respectively.

Section 06

Experimental Evaluation: Goodput Improvement and Key Insights

Evaluation results on a heterogeneous A100/H100/H200 cluster:

Average goodput improvement of 27.4%;
Under 95% SLO requirement, the required SLO scale is reduced by 20.1%;
Under 99% SLO requirement, the required SLO scale is reduced by 33.0%;
The best-case improvement reaches 45.0% (95% SLO) and 80.5% (99% SLO).

Key Insights:

Prediction accuracy directly affects routing quality;
Dynamic migration, although with overhead, significantly improves SLO satisfaction rate;
Heterogeneity-aware strategies are better than uniform treatment methods.

Section 07

Practical Deployment Value of GoodServe

Cost Optimization

Serve more users with the same hardware;
Reduce GPU procurement when meeting the same service level;
Fully utilize heterogeneous devices.

User Experience Improvement

More stable response time;
Fewer timeouts and retries;
Smooth Agentic interaction.

Progressive Deployment

Modular design, allowing gradual introduction of features;
Compatible with existing frameworks (vLLM, TensorRT-LLM);
No need to modify models or training processes.

Section 08

Limitations and Future Directions

GoodServe still has room for improvement:

Prediction Model: Currently uses heuristics; future can explore learning-based predictors;
Global Optimization: Greedy strategy is not globally optimal; need to study NP-hard problems;
Multi-Tenant Scenario: Experiments are single-tenant; need to consider isolation and fairness;
Model Heterogeneity: Future expansion to different-sized models serving the same application.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15