# AgentFloor: How Far Can Small Open-Source Models Go on the Tool-Use Capability Ladder?

> AgentFloor is a deterministic benchmark with a six-level capability ladder, evaluating the performance of 16 open-source models (0.27B-32B parameters) and GPT-5 in agent workflows. The study found that small and medium-sized open-source models are sufficient to handle most short-horizon structured tool-use tasks, while long-horizon planning remains a strength of cutting-edge models.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-01T01:25:56.000Z
- 最近活动: 2026-05-04T02:55:17.925Z
- 热度: 79.0
- 关键词: 代理系统, 工具使用, 模型评估, 开源模型, GPT-5, 分层路由, AI成本优化, 长视域规划
- 页面链接: https://www.zingnex.cn/en/forum/thread/agentfloor
- Canonical: https://www.zingnex.cn/forum/thread/agentfloor
- Markdown 来源: floors_fallback

---

## Key Findings of the AgentFloor Benchmark: Small Open-Source Models Can Handle Most Tool-Use Tasks; Long-Horizon Planning Still Requires Cutting-Edge Models

AgentFloor is a deterministic benchmark with a six-level capability ladder, evaluating the performance of 16 open-source models (0.27B-32B parameters) and GPT-5 in agent workflows. Key findings: Small and medium-sized open-source models are sufficient to handle most short-horizon structured tool-use tasks; the strongest open-source model (32B parameters) matches GPT-5 in aggregate evaluation while being more cost-effective; long-horizon planning remains a strength of cutting-edge models, and even GPT-5 does not achieve strong reliability. The study recommends using a hierarchical routing strategy to optimize the cost of agent systems.

## Cost Dilemma of Agent Systems: Why Do We Need the AgentFloor Benchmark?

Production-grade agent systems provide automated services through multi-step tool calls, but frequent use of large cutting-edge models (e.g., GPT-5) can lead to cost overruns. Most calls are short, structured routine tasks (such as checking calendars, formatting outputs), raising a key question: Which tasks require large models, and which can be handled by small models? The AgentFloor benchmark was designed for this purpose.

## AgentFloor Benchmark Design: Six-Level Capability Ladder and Evaluation Methods

AgentFloor includes 30 deterministic tasks, divided into six capability levels: 1. Instruction Following (basic instruction execution); 2. Basic Tool Use (single tool call); 3. Parameterized Tool Call (dynamic parameter construction); 4. Multi-Tool Coordination (collaboration of multiple tools); 5. Multi-Step Planning (complex goal planning); 6. Long-Horizon Constrained Planning (long-time-span planning). It uses deterministic evaluation (clear answers) to assess 16 open-source models and GPT-5, with a total of 16,542 scored runs.

## Key Finding: The Gap Between Small Model Performance and Long-Horizon Planning

Small models (0.27B-7B parameters) perform reliably in lower-level tasks (levels 1-4); the 32B open-source model matches GPT-5 while being more cost-effective and faster; however, in level 6 long-horizon planning tasks, cutting-edge models (e.g., GPT-5) still have an advantage—these tasks require maintaining state, tracking constraints, and dynamic adjustment, and even GPT-5 does not achieve strong reliability.

## Model Scale and Capability: Nonlinear Relationship and Differences in Intervention Effects

Capability boundaries are not determined solely by scale; architecture, training data, and optimization objectives also influence capabilities. The effects of intervention measures (Chain-of-Thought prompting, few-shot examples, tool description optimization) vary across models—there is no one-size-fits-all strategy.

## Implications for Agent System Design: Hierarchical Routing Strategy and Cost Optimization

A hierarchical routing strategy is recommended: small models handle tasks at levels 1-4, medium models handle level 5, and cutting-edge models handle level 6. The architecture includes a router, fast/standard/deep paths, and a fallback mechanism. Costs can be reduced to 20-30% of using GPT-5 exclusively, with comparable or higher success rates.

## Limitations, Future Directions, and Significance for the Open-Source Ecosystem

Limitations: Task scope is limited to tool use, static evaluation, and limited model coverage; Future directions: Dynamic routing learning, multi-model collaboration, capability prediction; Significance for open-source: Provides benchmark resources, promotes AI democratization, and open-source models can compete with commercial models.