# windmill-bench: An Execution-Level Benchmark for AI Agent Workflow Generation

> The first public benchmark for AI agents generating Windmill workflows, which achieves execution-level scoring by executing generated results in an actual workflow engine and comparing outputs.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-08T19:14:56.000Z
- 最近活动: 2026-05-08T19:18:06.998Z
- 热度: 148.9
- 关键词: AI智能体, 基准测试, 工作流生成, Windmill, 代码生成, 执行级评分, AgentClash
- 页面链接: https://www.zingnex.cn/en/forum/thread/windmill-bench-ai
- Canonical: https://www.zingnex.cn/forum/thread/windmill-bench-ai
- Markdown 来源: floors_fallback

---

## Introduction: windmill-bench—An Execution-Level Benchmark for AI Agent Workflow Generation

windmill-bench is the first public benchmark for AI agents generating Windmill workflows. Its core feature is achieving execution-level scoring by executing generated results in a real workflow engine and comparing outputs. It aims to address the limitations of existing code generation evaluation benchmarks and provide a more accurate assessment of AI agents' actual performance in workflow generation.

## Background: Dilemmas in Existing Code Generation Evaluation

With the improvement of large language models' capabilities in code generation tasks, existing benchmarks have obvious limitations: HumanEval and others focus on single-function generation; AppWorld and others test API/UI task completion capabilities; WorFBench and others emphasize downstream performance of operator graphs. Moreover, generated code is rarely executed in real workflow runtime environments, relying instead on static analysis or LLM-as-Judge, which fails to capture the characteristics of actual runtime behavior.

## Core Concept: Execution-Level Scoring and Selection of Windmill Engine

windmill-bench proposes an execution-level scoring scheme: executing AI-generated code in a real workflow engine and comparing outputs. The Windmill engine is chosen because it is an open-source production-grade platform with complete features (workspace state management, typed resources, key management, third-party module Hub, multi-language script execution), making it an ideal carrier for evaluating AI workflow generation capabilities.

## Benchmark Design: Three-Level Difficulty Task System

windmill-bench divides test tasks into three difficulty levels: Simple (2-step linear process, testing basic process generation and syntax correctness); Medium (3-step process, introducing branch logic and Hub script lookup); Hard (4-step process, including parallel/loop structures and typed resource input).

## Scoring Dimensions: Comprehensive Multi-Dimensional Evaluation

The project designs a multi-dimensional scoring system: Parsing Validity (code can be correctly parsed by Windmill); No Hallucination Grounding (no fictional resources or scripts); Execution Success Rate (workflow executes without errors); Output Matching Degree (execution results are consistent with reference outputs).

## Technical Implementation: Deep Integration with AgentClash Platform

windmill-bench is integrated with the AgentClash platform as a challenge package. Its core components include: E2B Sandbox Template (pre-configured Docker image to ensure repeatable and isolated environments); Task Definition (including natural language description, reference workflow definition, and validation data); Scorer (scoring logic for the four dimensions); Runner (glue code that drives the model to complete tasks).

## Project Status and Roadmap

The project is currently in the pre-v0 phase (architecture design and scaffolding setup, not yet in a runnable state). Roadmap: v1 focuses on single-generation process (no multi-round debugging involved); v2 considers adding multi-round optimization capabilities based on error feedback. Differences from Windmill's official tests: publicly open, higher difficulty, execution-level scoring, and the goal is a public benchmark leaderboard rather than CI access control.
