Section 01
GeoBuildBench: A New Benchmark for Evaluating Executable Geometric Construction from Natural Language
GeoBuildBench is a novel benchmark designed to assess large language models (LLMs) and multimodal agents' ability to convert natural language geometric problems into executable domain-specific language (DSL) programs. Unlike existing benchmarks that focus on answer correctness or static image understanding, it fills the gap by emphasizing the interactive, constructive nature of geometry. The benchmark uses 489 Chinese textbook-style problems and reveals key limitations of current models, such as structural hallucinations and constraint satisfaction failures.