Section 01
[Introduction] Claw-Eval: A New Benchmark for Building Trustworthy Autonomous Agent Evaluation
Claw-Eval is an end-to-end evaluation suite designed to address three key flaws in existing autonomous agent benchmarks: opaque trajectory scoring mechanisms, unclear definitions for safety and robustness evaluation, and limited modal coverage. Through trajectory-aware scoring, fine-grained safety and robustness testing, and multi-modal task coverage, it provides a new evaluation framework for building trustworthy autonomous agents.