Reading

SAGA: A Workflow-Level Scheduling Revolution for AI Agent GPU Clusters

This article explains the SAGA scheduling system, the first GPU cluster scheduling framework that treats AI Agent workflows as atomic scheduling units. Through KV cache reuse and task completion time fairness optimization, it achieves a 1.64x reduction in end-to-end latency.

AI AgentGPU调度LLM推理KV缓存分布式系统vLLM复合AI

Published 2026-05-01 17:05Recent activity 2026-05-04 11:21Estimated read 5 min

SAGA: A Workflow-Level Scheduling Revolution for AI Agent GPU Clusters

Section 01

Introduction: SAGA—A Revolutionary Framework for AI Agent GPU Cluster Scheduling

This article explains the SAGA scheduling system, the first GPU cluster scheduling framework that treats AI Agent workflows as atomic scheduling units. Addressing the flaw of existing scheduling paradigms that treat single LLM calls as independent requests, SAGA achieves a 1.64x reduction in end-to-end latency through three core mechanisms: KV cache reuse prediction, session-affinity batching with work stealing, and Agent fair share optimization, providing a key solution for large-scale deployment of AI Agents.

Section 02

Background: Fundamental Flaws of Existing GPU Scheduling Paradigms

AI Agents complete complex tasks (e.g., code generation, web browsing) through chained LLM calls, forming tightly coupled workflows. However, existing GPU schedulers (like vLLM) use single requests as scheduling units, discarding intermediate KV caches and leading to repeated computations, which amplifies end-to-end latency by 3-8x. This "request-level abstraction" is fundamentally mismatched with the "program-level abstraction" (workflow as unit) required by AI Agents, restricting large-scale deployment.

Section 03

Three Core Innovative Mechanisms of SAGA

1. Agent Execution Graph and KV Cache Reuse Prediction

SAGA introduces the Agent Execution Graph, requiring Agents to explicitly declare workflow structures, predict cross-step KV cache reuse opportunities, and its cache management is close to the Belady optimal offline strategy (within 1.31x).

2. Session-Affinity Batching and Work Stealing

Prioritize scheduling requests from the same Agent workflow to the same GPU to ensure cache reuse; at the same time, balance load through work stealing mechanisms to avoid overload caused by affinity.

3. Agent Fair Share and Bounded Deviation Guarantee

Measure fairness based on task completion time, ensure each Agent gets a proportional resource share, and provide a provable bounded deviation guarantee to prevent complex tasks from monopolizing resources.

Section 04

Experimental Evidence: Performance of SAGA

Tested on a 64-GPU cluster with SWE-bench and WebArena benchmark loads:

Geometric mean task completion time reduced by 1.64x (p<0.001);
GPU memory utilization increased by 1.22x;
SLO achievement rate of 99.2% in multi-tenant scenarios;
Prioritizes latency, peak throughput is 30% lower than pure batch scheduling (a reasonable trade-off since Agent loads are latency-sensitive).

Section 05

Conclusion: The Significance of SAGA for AI Infrastructure

SAGA acutely identifies the essential differences between AI Agent workloads and traditional LLM inference, solves performance bottlenecks through workflow-level scheduling upgrades, and is an important advancement in the AI infrastructure field. It not only improves the execution efficiency of Agent tasks but also reflects on the fundamental direction of AI system design—workload abstraction levels need to match application paradigms.

Section 06

Future Directions and Technical Insights

Research insights from SAGA:

Programming model: Agent frameworks need to explicitly declare execution graphs;
Hardware design: GPUs need native support for cross-step state retention;
Cloud-native orchestration: Kubernetes and others need to introduce workflow-level scheduling primitives;
Billing model: Shifting from token-based to workflow-based billing is more reasonable.

SAGA: A Workflow-Level Scheduling Revolution for AI Agent GPU Clusters

Introduction: SAGA—A Revolutionary Framework for AI Agent GPU Cluster Scheduling

Background: Fundamental Flaws of Existing GPU Scheduling Paradigms

Three Core Innovative Mechanisms of SAGA

1. Agent Execution Graph and KV Cache Reuse Prediction

2. Session-Affinity Batching and Work Stealing

3. Agent Fair Share and Bounded Deviation Guarantee

Experimental Evidence: Performance of SAGA

Conclusion: The Significance of SAGA for AI Infrastructure

Future Directions and Technical Insights

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model