Zing Forum

Reading

windmill-bench: An Execution-Level Benchmark for AI Agent Workflow Generation

The first public benchmark for AI agents generating Windmill workflows, which achieves execution-level scoring by executing generated results in an actual workflow engine and comparing outputs.

AI智能体基准测试工作流生成Windmill代码生成执行级评分AgentClash
Published 2026-05-09 03:14Recent activity 2026-05-09 03:18Estimated read 5 min
windmill-bench: An Execution-Level Benchmark for AI Agent Workflow Generation
1

Section 01

Introduction: windmill-bench—An Execution-Level Benchmark for AI Agent Workflow Generation

windmill-bench is the first public benchmark for AI agents generating Windmill workflows. Its core feature is achieving execution-level scoring by executing generated results in a real workflow engine and comparing outputs. It aims to address the limitations of existing code generation evaluation benchmarks and provide a more accurate assessment of AI agents' actual performance in workflow generation.

2

Section 02

Background: Dilemmas in Existing Code Generation Evaluation

With the improvement of large language models' capabilities in code generation tasks, existing benchmarks have obvious limitations: HumanEval and others focus on single-function generation; AppWorld and others test API/UI task completion capabilities; WorFBench and others emphasize downstream performance of operator graphs. Moreover, generated code is rarely executed in real workflow runtime environments, relying instead on static analysis or LLM-as-Judge, which fails to capture the characteristics of actual runtime behavior.

3

Section 03

Core Concept: Execution-Level Scoring and Selection of Windmill Engine

windmill-bench proposes an execution-level scoring scheme: executing AI-generated code in a real workflow engine and comparing outputs. The Windmill engine is chosen because it is an open-source production-grade platform with complete features (workspace state management, typed resources, key management, third-party module Hub, multi-language script execution), making it an ideal carrier for evaluating AI workflow generation capabilities.

4

Section 04

Benchmark Design: Three-Level Difficulty Task System

windmill-bench divides test tasks into three difficulty levels: Simple (2-step linear process, testing basic process generation and syntax correctness); Medium (3-step process, introducing branch logic and Hub script lookup); Hard (4-step process, including parallel/loop structures and typed resource input).

5

Section 05

Scoring Dimensions: Comprehensive Multi-Dimensional Evaluation

The project designs a multi-dimensional scoring system: Parsing Validity (code can be correctly parsed by Windmill); No Hallucination Grounding (no fictional resources or scripts); Execution Success Rate (workflow executes without errors); Output Matching Degree (execution results are consistent with reference outputs).

6

Section 06

Technical Implementation: Deep Integration with AgentClash Platform

windmill-bench is integrated with the AgentClash platform as a challenge package. Its core components include: E2B Sandbox Template (pre-configured Docker image to ensure repeatable and isolated environments); Task Definition (including natural language description, reference workflow definition, and validation data); Scorer (scoring logic for the four dimensions); Runner (glue code that drives the model to complete tasks).

7

Section 07

Project Status and Roadmap

The project is currently in the pre-v0 phase (architecture design and scaffolding setup, not yet in a runnable state). Roadmap: v1 focuses on single-generation process (no multi-round debugging involved); v2 considers adding multi-round optimization capabilities based on error feedback. Differences from Windmill's official tests: publicly open, higher difficulty, execution-level scoring, and the goal is a public benchmark leaderboard rather than CI access control.