Section 01
[Introduction] Harness-Bench: A Diagnostic Benchmark for Evaluating the Impact of System-Level Configurations on LLM Agent Workflow Performance
Harness-Bench is a diagnostic benchmark for evaluating the impact of system-level (harness) configurations of large model agents on real-world workflows. Through 106 sandbox offline tasks, it reveals the significant effects of model-system configuration combinations on completion rate, process quality, efficiency, and failure behaviors. This benchmark fills the gap in evaluating the impact of system-level configurations, emphasizing that agent capability is a joint function of the model and system-level configurations.