Section 01
Introduction: windmill-bench—An Execution-Level Benchmark for AI Agent Workflow Generation
windmill-bench is the first public benchmark for AI agents generating Windmill workflows. Its core feature is achieving execution-level scoring by executing generated results in a real workflow engine and comparing outputs. It aims to address the limitations of existing code generation evaluation benchmarks and provide a more accurate assessment of AI agents' actual performance in workflow generation.