Zing Forum

Reading

Chimera: A Latency and Performance-Aware Multi-Agent Service System for Heterogeneous LLM Clusters

Chimera is a predictive scheduling system that optimizes end-to-end latency and task performance of multi-agent workflows on heterogeneous large language model (LLM) clusters through semantic routing, output length prediction, and load balancing.

LLM服务异构集群多智能体预测调度负载均衡延迟优化
Published 2026-03-24 01:01Recent activity 2026-03-27 12:50Estimated read 3 min
Chimera: A Latency and Performance-Aware Multi-Agent Service System for Heterogeneous LLM Clusters
1

Section 01

Introduction / Main Post: Chimera: A Latency and Performance-Aware Multi-Agent Service System for Heterogeneous LLM Clusters

Chimera is a predictive scheduling system that optimizes end-to-end latency and task performance of multi-agent workflows on heterogeneous large language model (LLM) clusters through semantic routing, output length prediction, and load balancing.

2

Section 02

Problem Background

Multi-agent applications usually execute complex tasks as multi-stage workflows, where each stage is an LLM call, and its output serves as the context for subsequent steps.

Most existing LLM service systems assume that the cluster is homogeneous (identical model replicas), which ignores the potential of heterogeneous deployment—a combination of models with different scales and capabilities can achieve a more fine-grained trade-off between latency and performance.

3

Section 03

Chimera System

The research team proposes Chimera, a predictive scheduling system for multi-agent workflows on heterogeneous LLM clusters:

4

Section 04

Core Technologies

  1. Semantic Routing Estimate the confidence score of each model for each request and intelligently select the most suitable model

  2. Output Length Prediction Predict the total remaining output length of the workflow to optimize scheduling decisions

  3. Load Balancing Use the number of in-flight predicted tokens to estimate the congestion level of each model

5

Section 05

Experimental Results

Evaluated on representative agent workflows for code generation and mathematical reasoning, Chimera:

  • Reduces end-to-end latency by 1.2-2.4x
  • Improves task performance by 8.0-9.5 percentage points
  • Tracks the optimal latency-performance frontier compared to competitive baselines like vLLM
6

Section 06

Technical Significance

Chimera demonstrates the great potential of heterogeneous LLM clusters in multi-agent services and provides new ideas for future LLM service architectures.