# Scepsy: An Aggregated LLM Service System for Multi-Agent Workflows

> Scepsy optimizes GPU resource allocation by building an aggregated LLM pipeline and leveraging the stability of the execution time proportion of each model, achieving a 2.4x throughput increase and a 27x latency reduction in real-world multi-agent workflows.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-16T16:15:29.000Z
- 最近活动: 2026-04-17T02:17:56.313Z
- 热度: 126.0
- 关键词: 智能体工作流, LLM服务系统, GPU调度, 资源优化, 聚合流水线
- 页面链接: https://www.zingnex.cn/en/forum/thread/scepsy-llm
- Canonical: https://www.zingnex.cn/forum/thread/scepsy-llm
- Markdown 来源: floors_fallback

---

## [Main Post/Introduction] Scepsy: Core Highlights of the Aggregated LLM Service System for Multi-Agent Workflows

Scepsy is an aggregated LLM service system for multi-agent workflows. Its core lies in building an aggregated LLM pipeline and optimizing GPU resource allocation using the stability of the execution time proportion of each model, achieving a 2.4x throughput increase and a 27x latency reduction in real-world multi-agent workflows.

## Background and Challenges: Three Core Difficulties in Deploying Agent Workflows

With the evolution of LLM capabilities, agent workflows have become the mainstream paradigm for handling complex tasks, but deployment faces three major challenges: 1. Highly uncertain execution paths make end-to-end latency difficult to predict; 2. Multiple LLM calls lead to over-subscription of GPU resources; 3. Large semantic differences exist between different agent frameworks (e.g., LangChain, AutoGPT), making it hard to design general scheduling strategies. Existing systems mostly focus on single-model optimization or rely on manual configuration, which cannot handle dynamics and complexity.

## Core Insights and System Architecture: Scepsy's Design Approach

Scepsy's key insight: Although the end-to-end latency of a single workflow is hard to predict, the execution time proportion of each LLM is relatively stable. Based on this, two core abstractions are introduced:
1. Aggregated LLM Pipeline: A lightweight latency/throughput predictor that quickly estimates performance under resource configurations;
2. Hierarchical Heuristic Scheduler: Maps optimal configurations to GPU clusters, minimizing resource fragmentation and meeting network constraints.
System deployment is divided into three phases:
- Performance Profiling: Offline analysis of performance characteristics of each LLM under different parallelism levels;
- Configuration Search: Efficiently search for optimal configurations in the three-dimensional space of fractional GPU shares, tensor parallelism, and replica count;
- Cluster Placement: Hierarchical strategy (node → rack) maps configurations to physical clusters, balancing performance and resource efficiency.

## Experimental Evidence: Performance Improvements in Real-World Scenarios

Evaluations in real-world agent workflow scenarios such as code generation, multi-turn dialogue, and tool calling show:
- Compared to traditional methods that independently optimize single models, Scepsy achieves a maximum 2.4x throughput increase (by identifying critical paths and allocating more resources);
- Compared to user-manually configured systems, it achieves a maximum 27x latency reduction (avoiding the blindness of manual configuration);
- No need to modify workflow code or restrict frameworks, ensuring generality.

## Technical Significance and Industry Impact

Scepsy marks the shift of LLM service systems from single-model optimization to multi-model collaborative optimization. Its workload-aware design philosophy (using workload characteristics to guide resource decisions) provides a direction for the development of AI infrastructure. For developers/enterprises: No need to reserve large amounts of GPU resources or perform manual tuning; the system automatically finds the optimal solution, allowing focus on application logic, reducing deployment costs and complexity.

## Summary and Outlook

Scepsy solves the service challenges of multi-LLM agent workflows through aggregated LLM pipelines and hierarchical scheduling. Its core contribution is using the stability of execution time proportion to transform end-to-end optimization into component-level optimization. Future outlook: Address more complex workflows (collaboration of dozens of LLMs) and explore directions for dynamically adjusting configurations online to adapt to load changes.
