Zing Forum

Reading

HexAGenT: A Heterogeneity-Aware Scheduling System for Agent Workflows

HexAGenT is a workflow-aware scheduler for agent LLM applications. It significantly reduces workflow-level latency and improves SLO compliance on heterogeneous GPU clusters through online DAG modeling and risk-aware priority strategies.

智能体LLM工作流调度异构GPU预填充-解码分离SLO优化在线DAGLLM服务
Published 2026-05-16 05:09Recent activity 2026-05-19 10:54Estimated read 6 min
HexAGenT: A Heterogeneity-Aware Scheduling System for Agent Workflows
1

Section 01

HexAGenT: Introduction to the Heterogeneity-Aware Scheduling System for Agent Workflows

HexAGenT is a workflow-aware scheduler for agent LLM applications, designed to optimize end-to-end workflow latency and SLO compliance on heterogeneous GPU clusters. Its core technologies include online DAG modeling, risk-aware priority strategies, and joint resource selection, which can significantly reduce SLO gaps and improve heterogeneous resource utilization.

2

Section 02

Workflow Scheduling Challenges for Agent LLM Applications

Agent LLM applications convert user requests into multi-step workflows, requiring attention to end-to-end latency rather than single-call performance. Their scheduling faces three major challenges: 1. Workflow dependencies are gradually revealed at runtime, requiring online decisions; 2. In heterogeneous GPU clusters (e.g., mixed A100/H100/H200), the requirements for prefill/decoding phases vary greatly; 3. A global perspective is needed to ensure workflow-level SLO goals.

3

Section 03

HexAGenT Architecture: Workflow-Aware Intelligent Scheduling Design

HexAGenT's core design includes: 1. Online DAG modeling: dynamically tracks workflow structure, identifies ready calls and bottlenecks; 2. Completion time estimation: provides a benchmark for decision-making based on the time consumed by completed calls, predictions of pending executions, and system load; 3. Risk-aware priority: prioritizes calls that have a large impact on workflow completion time and high risk; 4. Joint resource selection: optimizes KV cache and transmission latency by integrating prefill/decoding GPU placement and local queue priority.

4

Section 04

Experimental Results: SLO Improvement and Heterogeneous Resource Utilization

Experiments on heterogeneous A100/H100/H200 clusters show: 1. Significant reduction in SLO gaps: an average reduction of 20.1% (max 45%) at 95% compliance rate, and an average reduction of 33% (max 80.5%) at 99% compliance rate; 2. Optimized heterogeneous resource utilization: A100 is suitable for compute-intensive prefill, H100/H200 are suitable for long-sequence decoding, and the scheduler can dynamically match tasks to GPU types.

5

Section 05

Technical Insights: Key Factors for HexAGenT's Effectiveness

HexAGenT's success stems from three key insights: 1. Workflow-level optimization: prioritizes critical path calls to minimize end-to-end latency; 2. Heterogeneity-aware matching: selects the optimal GPU based on task characteristics (prefill/decoding requirements, sequence length, etc.); 3. Online adaptability: dynamically adjusts strategies to adapt to the dependency structure revealed during workflow runtime.

6

Section 06

Practical Deployment Value of HexAGenT

Its value for production environments includes: 1. Cost-effectiveness: the same hardware supports more users or complex applications, reducing hardware costs; 2. SLO guarantee: significantly improves tail latency (99% compliance rate improvement) and provides a consistent user experience; 3. Mixed cluster utilization: efficiently uses mixed GPU configurations in data centers and avoids resource isolation.

7

Section 07

Limitations and Future Exploration Directions

HexAGenT still needs optimization: 1. Adapt to more complex agent patterns (e.g., loops, parallel tool calls); 2. Support multi-tenant scenarios, balancing fairness and global efficiency; 3. Collaborate with model optimizations (speculative decoding, quantization, etc.) to maximize end-to-end efficiency.

8

Section 08

Conclusion: A New Direction for Agent LLM Service Optimization

HexAGenT marks a new direction for agent LLM service optimization from single-call to workflow-level. Through technologies such as online DAG modeling and risk-aware priority, it achieves significant SLO improvements on heterogeneous clusters. As agent applications become mainstream, this workflow-aware scheduling will become increasingly important.