Section 01
[Introduction] Milo-Bench: A Frozen, Deterministic Longitudinal Evaluation Framework for Fair LLM Comparison
Milo-Bench is an evaluation suite for large language models (LLMs), designed to address issues in traditional evaluations such as unstable test sets, subjective scoring, and lack of historical tracking. Its core mechanisms include frozen test cases (never modified once locked), deterministic scoring (based on objective check items), and SQLite persistent storage (tracking historical results), enabling fair longitudinal comparisons between different models/versions and providing reproducible performance evaluation basis for developers and researchers.