Reading

AgentFloor: How Far Can Small Open-Source Models Go on the Tool-Use Capability Ladder?

代理系统工具使用模型评估开源模型GPT-5分层路由AI成本优化长视域规划

Published 2026-05-01 09:25Recent activity 2026-05-04 10:55Estimated read 5 min

AgentFloor: How Far Can Small Open-Source Models Go on the Tool-Use Capability Ladder?

Section 01

Key Findings of the AgentFloor Benchmark: Small Open-Source Models Can Handle Most Tool-Use Tasks; Long-Horizon Planning Still Requires Cutting-Edge Models

AgentFloor is a deterministic benchmark with a six-level capability ladder, evaluating the performance of 16 open-source models (0.27B-32B parameters) and GPT-5 in agent workflows. Key findings: Small and medium-sized open-source models are sufficient to handle most short-horizon structured tool-use tasks; the strongest open-source model (32B parameters) matches GPT-5 in aggregate evaluation while being more cost-effective; long-horizon planning remains a strength of cutting-edge models, and even GPT-5 does not achieve strong reliability. The study recommends using a hierarchical routing strategy to optimize the cost of agent systems.

Section 02

Cost Dilemma of Agent Systems: Why Do We Need the AgentFloor Benchmark?

Production-grade agent systems provide automated services through multi-step tool calls, but frequent use of large cutting-edge models (e.g., GPT-5) can lead to cost overruns. Most calls are short, structured routine tasks (such as checking calendars, formatting outputs), raising a key question: Which tasks require large models, and which can be handled by small models? The AgentFloor benchmark was designed for this purpose.

Section 03

AgentFloor Benchmark Design: Six-Level Capability Ladder and Evaluation Methods

AgentFloor includes 30 deterministic tasks, divided into six capability levels: 1. Instruction Following (basic instruction execution); 2. Basic Tool Use (single tool call); 3. Parameterized Tool Call (dynamic parameter construction); 4. Multi-Tool Coordination (collaboration of multiple tools); 5. Multi-Step Planning (complex goal planning); 6. Long-Horizon Constrained Planning (long-time-span planning). It uses deterministic evaluation (clear answers) to assess 16 open-source models and GPT-5, with a total of 16,542 scored runs.

Section 04

Key Finding: The Gap Between Small Model Performance and Long-Horizon Planning

Small models (0.27B-7B parameters) perform reliably in lower-level tasks (levels 1-4); the 32B open-source model matches GPT-5 while being more cost-effective and faster; however, in level 6 long-horizon planning tasks, cutting-edge models (e.g., GPT-5) still have an advantage—these tasks require maintaining state, tracking constraints, and dynamic adjustment, and even GPT-5 does not achieve strong reliability.

Section 05

Model Scale and Capability: Nonlinear Relationship and Differences in Intervention Effects

Capability boundaries are not determined solely by scale; architecture, training data, and optimization objectives also influence capabilities. The effects of intervention measures (Chain-of-Thought prompting, few-shot examples, tool description optimization) vary across models—there is no one-size-fits-all strategy.

Section 06

Implications for Agent System Design: Hierarchical Routing Strategy and Cost Optimization

A hierarchical routing strategy is recommended: small models handle tasks at levels 1-4, medium models handle level 5, and cutting-edge models handle level 6. The architecture includes a router, fast/standard/deep paths, and a fallback mechanism. Costs can be reduced to 20-30% of using GPT-5 exclusively, with comparable or higher success rates.

Section 07

Limitations, Future Directions, and Significance for the Open-Source Ecosystem

Limitations: Task scope is limited to tool use, static evaluation, and limited model coverage; Future directions: Dynamic routing learning, multi-model collaboration, capability prediction; Significance for open-source: Provides benchmark resources, promotes AI democratization, and open-source models can compete with commercial models.

AgentFloor: How Far Can Small Open-Source Models Go on the Tool-Use Capability Ladder?

Key Findings of the AgentFloor Benchmark: Small Open-Source Models Can Handle Most Tool-Use Tasks; Long-Horizon Planning Still Requires Cutting-Edge Models

Cost Dilemma of Agent Systems: Why Do We Need the AgentFloor Benchmark?

AgentFloor Benchmark Design: Six-Level Capability Ladder and Evaluation Methods

Key Finding: The Gap Between Small Model Performance and Long-Horizon Planning

Model Scale and Capability: Nonlinear Relationship and Differences in Intervention Effects

Implications for Agent System Design: Hierarchical Routing Strategy and Cost Optimization

Limitations, Future Directions, and Significance for the Open-Source Ecosystem

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model