# From Model Scaling to System Scaling: A New Paradigm of Harness Scaling for Agentic AI

> The paper proposes that the next bottleneck for Agentic AI lies in system scaling rather than model scaling. It defines six core components of Agent Harness via the CheetahClaws framework and calls for establishing Harness-level evaluation benchmarks that go beyond task success rates.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-25T17:59:36.000Z
- 最近活动: 2026-05-26T04:54:44.787Z
- 热度: 140.1
- 关键词: Agentic AI, Agent Harness, 系统扩展, 上下文治理, 可信记忆, 技能路由, CheetahClaws, Agent评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/agentic-aiharness-scaling
- Canonical: https://www.zingnex.cn/forum/thread/agentic-aiharness-scaling
- Markdown 来源: floors_fallback

---

## [Introduction] From Model Scaling to System Scaling: A New Paradigm of Harness Scaling for Agentic AI

Core观点 of the paper: The next bottleneck for Agentic AI is system scaling rather than model scaling. It defines six core components of Agent Harness through the CheetahClaws framework and calls for establishing Harness-level evaluation benchmarks beyond task success rates.

Source Information:
- Author Team: SafeRL-Lab (CheetahClaws Development Team)
- Publication Date: May 25, 2026
- Original Link: http://arxiv.org/abs/2605.26112v1
- Source Platform: arXiv
- Original Title: From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

## Background: Evaluation Dilemmas of Agentic AI

In recent years, large models like GPT-4 and Claude have driven the explosion of AI Agent technology, but existing evaluation methods face fundamental dilemmas:

1. **Model-centric**: Only focuses on whether tasks are successful, ignoring key process details such as tool usage, memory management, and context utilization;
2. **Performance source bias**: Agent performance comes from complex interactions between models and system components, not just relying on underlying model capabilities;
3. **Evaluation limitations**: Traditional methods cannot reflect optimization space at the system level.

## Core Concept: Six Components of Agent Harness

The paper defines **Agent Harness** as a structured execution layer built around the base model, responsible for transforming the model's native capabilities into actual Agent behaviors. It includes six core components:

1. **Base Model**: The "brain" of the Harness, responsible for reasoning and response generation;
2. **Memory Substrate**: Stores/retrieves cross-cycle information (working memory, long-term memory, etc.);
3. **Context Constructor**: Selects relevant information from memory to build model inputs;
4. **Skill Routing Layer**: Decides tool invocation timing, parameter passing, and result processing;
5. **Orchestration Loop**: The "heart" that coordinates component interactions and defines decision-making processes;
6. **Validation and Governance Layer**: Responsible for security checks, permission management, log auditing, etc.

## Three Bottlenecks of Harness Scaling and CheetahClaws Reference Implementation

### Three Core Bottlenecks
1. **Context Governance**: Information filtering, priority management, and dynamic adjustment under limited windows;
2. **Trustworthy Memory**: Memory accuracy, consistency, traceability, and forgetting strategies;
3. **Dynamic Skill Routing**: Tool selection, parameter filling, error recovery, and combination optimization.

### CheetahClaws Reference Implementation
- Design Principles: Modular, auditable, persistent, verifiable;
- Comparison with Existing Frameworks: Clear separation of six components, complete trajectory recording, native open-source support (different from Claude Code's closed-source and OpenClaw's partial open-source).

## Harness-Level Evaluation: A New Paradigm Beyond Task Success Rates

The paper calls for establishing **Harness-level evaluation benchmarks**, with new dimensions including:

1. Trajectory Quality (execution path efficiency);
2. Memory Hygiene (memory management quality);
3. Context Efficiency (window utilization optimization);
4. Communication Fidelity (tool interaction accuracy);
5. Validation Cost (behavior verification overhead);
6. Security Evolution (behavior predictability).

**Importance**: Distinguishes between Agents that "complete via trial and error" and those that "complete efficiently", supporting cost optimization, safety-critical applications, and long-term deployment needs.

## Technical Insights and Practical Recommendations

### Technical Insights
The progress of Agentic AI depends on the balance between system design and model capabilities: The model is a necessary condition, but the Harness determines whether potential can be effectively realized; System components (context construction, memory management, etc.) have independent research value.

### Practical Recommendations
1. **Separation of Concerns**: Clarify component interfaces and responsibilities to support independent optimization;
2. **Invest in Observability**: Record logs of model calls, memory operations, tool sequences, etc.;
3. **Establish Evaluation Pipelines**: Measure metrics such as model call count, context efficiency, and memory accuracy;
4. **Consider Long-Term Characteristics**: Design strategies for memory growth management and context drift correction.

## Limitations and Future Research Directions

### Current Limitations
1. CheetahClaws, as a prototype, has not been verified in large-scale production environments;
2. Specific metrics and test sets for Harness-level evaluation are still under development;
3. Insufficient generalization ability across domains (programming/dialogue/data analysis).

### Future Directions
1. **Adaptive Harness**: Dynamically adjust configurations to match task characteristics;
2. **Multi-Agent Collaboration**: Design Harness coordination mechanisms across Agents;
3. **Human-Agent Collaboration Harness**: Design Agent systems that support human intervention.
