Section 01
Key Findings of the AgentFloor Benchmark: Small Open-Source Models Can Handle Most Tool-Use Tasks; Long-Horizon Planning Still Requires Cutting-Edge Models
AgentFloor is a deterministic benchmark with a six-level capability ladder, evaluating the performance of 16 open-source models (0.27B-32B parameters) and GPT-5 in agent workflows. Key findings: Small and medium-sized open-source models are sufficient to handle most short-horizon structured tool-use tasks; the strongest open-source model (32B parameters) matches GPT-5 in aggregate evaluation while being more cost-effective; long-horizon planning remains a strength of cutting-edge models, and even GPT-5 does not achieve strong reliability. The study recommends using a hierarchical routing strategy to optimize the cost of agent systems.