Zing Forum

Reading

Scepsy: An Aggregated LLM Service System for Multi-Agent Workflows

Scepsy optimizes GPU resource allocation by building an aggregated LLM pipeline and leveraging the stability of the execution time proportion of each model, achieving a 2.4x throughput increase and a 27x latency reduction in real-world multi-agent workflows.

智能体工作流LLM服务系统GPU调度资源优化聚合流水线
Published 2026-04-17 00:15Recent activity 2026-04-17 10:17Estimated read 6 min
Scepsy: An Aggregated LLM Service System for Multi-Agent Workflows
1

Section 01

[Main Post/Introduction] Scepsy: Core Highlights of the Aggregated LLM Service System for Multi-Agent Workflows

Scepsy is an aggregated LLM service system for multi-agent workflows. Its core lies in building an aggregated LLM pipeline and optimizing GPU resource allocation using the stability of the execution time proportion of each model, achieving a 2.4x throughput increase and a 27x latency reduction in real-world multi-agent workflows.

2

Section 02

Background and Challenges: Three Core Difficulties in Deploying Agent Workflows

With the evolution of LLM capabilities, agent workflows have become the mainstream paradigm for handling complex tasks, but deployment faces three major challenges: 1. Highly uncertain execution paths make end-to-end latency difficult to predict; 2. Multiple LLM calls lead to over-subscription of GPU resources; 3. Large semantic differences exist between different agent frameworks (e.g., LangChain, AutoGPT), making it hard to design general scheduling strategies. Existing systems mostly focus on single-model optimization or rely on manual configuration, which cannot handle dynamics and complexity.

3

Section 03

Core Insights and System Architecture: Scepsy's Design Approach

Scepsy's key insight: Although the end-to-end latency of a single workflow is hard to predict, the execution time proportion of each LLM is relatively stable. Based on this, two core abstractions are introduced:

  1. Aggregated LLM Pipeline: A lightweight latency/throughput predictor that quickly estimates performance under resource configurations;
  2. Hierarchical Heuristic Scheduler: Maps optimal configurations to GPU clusters, minimizing resource fragmentation and meeting network constraints. System deployment is divided into three phases:
  • Performance Profiling: Offline analysis of performance characteristics of each LLM under different parallelism levels;
  • Configuration Search: Efficiently search for optimal configurations in the three-dimensional space of fractional GPU shares, tensor parallelism, and replica count;
  • Cluster Placement: Hierarchical strategy (node → rack) maps configurations to physical clusters, balancing performance and resource efficiency.
4

Section 04

Experimental Evidence: Performance Improvements in Real-World Scenarios

Evaluations in real-world agent workflow scenarios such as code generation, multi-turn dialogue, and tool calling show:

  • Compared to traditional methods that independently optimize single models, Scepsy achieves a maximum 2.4x throughput increase (by identifying critical paths and allocating more resources);
  • Compared to user-manually configured systems, it achieves a maximum 27x latency reduction (avoiding the blindness of manual configuration);
  • No need to modify workflow code or restrict frameworks, ensuring generality.
5

Section 05

Technical Significance and Industry Impact

Scepsy marks the shift of LLM service systems from single-model optimization to multi-model collaborative optimization. Its workload-aware design philosophy (using workload characteristics to guide resource decisions) provides a direction for the development of AI infrastructure. For developers/enterprises: No need to reserve large amounts of GPU resources or perform manual tuning; the system automatically finds the optimal solution, allowing focus on application logic, reducing deployment costs and complexity.

6

Section 06

Summary and Outlook

Scepsy solves the service challenges of multi-LLM agent workflows through aggregated LLM pipelines and hierarchical scheduling. Its core contribution is using the stability of execution time proportion to transform end-to-end optimization into component-level optimization. Future outlook: Address more complex workflows (collaboration of dozens of LLMs) and explore directions for dynamically adjusting configurations online to adapt to load changes.