Zing Forum

Reading

Claw-Eval: A New Benchmark for Building a Trustworthy Autonomous Agent Evaluation System

Claw-Eval is an end-to-end autonomous agent evaluation suite that addresses three key flaws in existing benchmarks through trajectory-aware scoring, fine-grained safety and robustness testing, and multi-modal task coverage.

自主智能体LLM评估基准测试安全性评估多模态轨迹感知鲁棒性测试
Published 2026-04-08 01:43Recent activity 2026-04-08 11:18Estimated read 5 min
Claw-Eval: A New Benchmark for Building a Trustworthy Autonomous Agent Evaluation System
1

Section 01

[Introduction] Claw-Eval: A New Benchmark for Building Trustworthy Autonomous Agent Evaluation

Claw-Eval is an end-to-end evaluation suite designed to address three key flaws in existing autonomous agent benchmarks: opaque trajectory scoring mechanisms, unclear definitions for safety and robustness evaluation, and limited modal coverage. Through trajectory-aware scoring, fine-grained safety and robustness testing, and multi-modal task coverage, it provides a new evaluation framework for building trustworthy autonomous agents.

2

Section 02

Systematic Flaws in Existing Agent Evaluation

Current mainstream evaluation methods use opaque trajectory scoring, which misses up to 44% of safety violations and 13% of robustness failure cases; safety and robustness evaluations lack standardized definitions, making results difficult to compare and reproduce; and they focus on single modalities or limited interaction scenarios, failing to reflect the complex needs of the real world.

3

Section 03

Core Design Philosophy of Claw-Eval

Claw-Eval is built around task diversity, evidence integrity, and scoring refinement: it includes 300 human-validated tasks (9 categories, 3 groups: general service orchestration, multi-modal perception and generation, multi-turn professional dialogue); records agent actions through three evidence channels: execution trajectories, audit logs, and environment snapshots; and establishes a quantitative system with 2159 fine-grained scoring items.

4

Section 04

Multi-dimensional Scoring Protocol of Claw-Eval

The scoring protocol evaluates from three dimensions: completion (task goal achievement), safety (compliance with safety norms), and robustness (stability against disturbances); it uses multiple statistical indicators: average score reflects overall level, Pass@k measures peak capability, and Pass^k reflects consistency and reliability.

5

Section 05

Key Findings from Claw-Eval Experiments

Testing on 14 cutting-edge models revealed: traditional opaque trajectory evaluation misses nearly half of safety violations and over 10% of robustness issues; error injection mainly affects consistency (Pass^3 drops by up to 24%) rather than peak capability (Pass@3 remains stable); multi-modal performance varies significantly—most models perform worse in video understanding tasks than in document and image tasks, and no model dominates all modalities.

6

Section 06

Practical Guidance from Claw-Eval for Agent Development

Developers need to focus on capability, safety, and reliability simultaneously: establish process monitoring mechanisms instead of just result verification; consider robustness (adversarial testing, disturbance injection) during the design phase; optimize multi-modal processing capabilities for specific scenarios; use fine-grained scoring to identify weaknesses (e.g., strengthen alignment training for low safety scores, improve reasoning consistency for robustness fluctuations).

7

Section 07

Significance and Outlook of Claw-Eval

Claw-Eval is an important advancement in the field of autonomous agent evaluation. It provides a more trustworthy and comprehensive evaluation framework, helps understand the current capability boundaries of agents, and points the way for building more trustworthy and reliable autonomous agents in the future.