# Chimera: A Latency and Performance-Aware Multi-Agent Service System for Heterogeneous LLM Clusters

> Chimera is a predictive scheduling system that optimizes end-to-end latency and task performance of multi-agent workflows on heterogeneous large language model (LLM) clusters through semantic routing, output length prediction, and load balancing.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-23T17:01:42.000Z
- 最近活动: 2026-03-27T04:50:18.822Z
- 热度: 75.0
- 关键词: LLM服务, 异构集群, 多智能体, 预测调度, 负载均衡, 延迟优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/chimera-llm
- Canonical: https://www.zingnex.cn/forum/thread/chimera-llm
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: Chimera: A Latency and Performance-Aware Multi-Agent Service System for Heterogeneous LLM Clusters

Chimera is a predictive scheduling system that optimizes end-to-end latency and task performance of multi-agent workflows on heterogeneous large language model (LLM) clusters through semantic routing, output length prediction, and load balancing.

## Problem Background

Multi-agent applications usually execute complex tasks as multi-stage workflows, where each stage is an LLM call, and its output serves as the context for subsequent steps.

Most existing LLM service systems assume that the cluster is **homogeneous** (identical model replicas), which ignores the potential of **heterogeneous deployment**—a combination of models with different scales and capabilities can achieve a more fine-grained trade-off between latency and performance.

## Chimera System

The research team proposes **Chimera**, a predictive scheduling system for multi-agent workflows on heterogeneous LLM clusters:

## Core Technologies

1. **Semantic Routing**
   Estimate the confidence score of each model for each request and intelligently select the most suitable model

2. **Output Length Prediction**
   Predict the total remaining output length of the workflow to optimize scheduling decisions

3. **Load Balancing**
   Use the number of in-flight predicted tokens to estimate the congestion level of each model

## Experimental Results

Evaluated on representative agent workflows for code generation and mathematical reasoning, Chimera:
- Reduces end-to-end latency by **1.2-2.4x**
- Improves task performance by **8.0-9.5 percentage points**
- Tracks the optimal latency-performance frontier compared to competitive baselines like vLLM

## Technical Significance

Chimera demonstrates the great potential of heterogeneous LLM clusters in multi-agent services and provides new ideas for future LLM service architectures.

## Resource Links

- Paper: http://arxiv.org/abs/2603.22206v1
