Section 01
TurtleAI Benchmark Evaluation Reveals Significant Limitations of Multimodal Models in Educational Visual Programming
The TurtleAI benchmark is the first to systematically evaluate the capabilities of vision-language models (VLMs) in education-oriented Turtle graphics programming tasks. Results show that even top models like GPT-4o have a success rate below 30%, with spatial reasoning and precise visual reproduction being the main bottlenecks. The data augmentation strategy proposed in the study can significantly improve model performance and provide important insights for educational AI applications.