Zing Forum

Reading

Milo-Bench: A Frozen, Deterministic Longitudinal Evaluation Framework for LLMs

Introducing Milo-Bench—a benchmark suite for fair longitudinal comparison of large language models (LLMs) via frozen test cases, deterministic scoring, and SQLite persistent storage.

LLM评测基准测试确定性评分纵向对比工具调用多步推理SQLite开源工具
Published 2026-04-13 04:42Recent activity 2026-04-13 04:49Estimated read 7 min
Milo-Bench: A Frozen, Deterministic Longitudinal Evaluation Framework for LLMs
1

Section 01

[Introduction] Milo-Bench: A Frozen, Deterministic Longitudinal Evaluation Framework for Fair LLM Comparison

Milo-Bench is an evaluation suite for large language models (LLMs), designed to address issues in traditional evaluations such as unstable test sets, subjective scoring, and lack of historical tracking. Its core mechanisms include frozen test cases (never modified once locked), deterministic scoring (based on objective check items), and SQLite persistent storage (tracking historical results), enabling fair longitudinal comparisons between different models/versions and providing reproducible performance evaluation basis for developers and researchers.

2

Section 02

[Background] Three Core Pain Points in the LLM Evaluation Field

The current LLM evaluation ecosystem has significant issues: 1. Unstable test sets: Most benchmarks continuously update questions, making results from different times incomparable; 2. Subjective scoring: Manual scoring is costly and standards are hard to unify; 3. Lack of historical data: Most tools only focus on single results and cannot track the evolution trajectory of models. These problems stem from the conflict between the concepts of "updating test sets" and "fair comparison."

3

Section 03

[Design Philosophy] Four Core Principles of Milo-Bench

The project design revolves around four key words:

  1. Freeze: Test cases are never modified once locked; create a new ID if updates are needed;
  2. Determinism: Abandon manual scoring; calculate scores (number of passes / total checks) via pure function check items (returning true/false);
  3. Longitudinal: Store results (including timestamps, model versions, scores, etc.) using SQLite to support performance trend tracking;
  4. Self-contained: No reliance on external resources; long texts are generated by deterministic algorithms, and code is executed in an isolated environment.
4

Section 04

[Technical Architecture] Evaluation System Covering Seven Capability Dimensions

Milo-Bench's evaluation system includes seven core categories:

  • Tool calling: Test tool selection, parameter passing, and ability to use tools appropriately;
  • Multi-step reasoning: Simulate workflows and check state consistency (e.g., configuration reading/conversion/writing);
  • Structured output: Generate content that meets format requirements, such as JSON and cron summaries;
  • Long context: Locate key information in massive text (e.g., needle in haystack);
  • Code ability: Verify code quality through programming tasks (e.g., LRU cache, IP parsing);
  • Cost efficiency: Evaluate resource usage such as token consumption and number of tool calls;
  • Agent workflow: Simulate end-to-end complex scenarios (6-15 tool calls).
5

Section 05

[Implementation Details] Check Mechanism for Deterministic Scoring

Milo-Bench achieves deterministic scoring through a variety of check types:

  • Tool call check: Verify whether a tool is called and whether parameters match (exact/regular expression/substring);
  • Output content check: String inclusion, regular expression matching;
  • JSON validation: Validity, field value/type, array length;
  • Code execution check: Run code and verify test cases;
  • Efficiency check: Monitor token count and key point count. Multi-step test process: Model calls tool → Executor returns mock response → Loop until completion or timeout, no need for real external resources.
6

Section 06

[Usage Guide] Execution and Report Generation

Milo-Bench provides a flexible command-line interface:

  • Run all evaluations: python bench.py --models all; specify version with --model-version;
  • Grouped execution: Support grouping by local/fast/heavy/cloud models, or filter by category;
  • Historical analysis: Use --compare to view model score trends, --leaderboard to generate rankings;
  • Report generation: HTML format includes visualizations such as rankings, trend charts, bar charts, and latency comparisons.
7

Section 07

[Insights and Recommendations] Value to the LLM Evaluation Ecosystem

Milo-Bench brings insights to the evaluation field:

  1. Stability first: Prioritize high quality and stability of core tests over comprehensiveness;
  2. Reproducibility engineering: Achieve systematic reproducibility through frozen tests and deterministic scoring;
  3. Versioned management: Dual version mechanism (suite_version and spec_version) to balance stability and expansion. Recommendations for teams: Draw on its design ideas (freeze, determinism, persistence, self-containment) to build evaluation solutions suitable for their own needs.