Reading

Milo-Bench: A Frozen, Deterministic Longitudinal Evaluation Framework for LLMs

Introducing Milo-Bench—a benchmark suite for fair longitudinal comparison of large language models (LLMs) via frozen test cases, deterministic scoring, and SQLite persistent storage.

LLM评测基准测试确定性评分纵向对比工具调用多步推理SQLite开源工具

Published 2026-04-13 04:42Recent activity 2026-04-13 04:49Estimated read 7 min

Milo-Bench: A Frozen, Deterministic Longitudinal Evaluation Framework for LLMs

Section 01

[Introduction] Milo-Bench: A Frozen, Deterministic Longitudinal Evaluation Framework for Fair LLM Comparison

Milo-Bench is an evaluation suite for large language models (LLMs), designed to address issues in traditional evaluations such as unstable test sets, subjective scoring, and lack of historical tracking. Its core mechanisms include frozen test cases (never modified once locked), deterministic scoring (based on objective check items), and SQLite persistent storage (tracking historical results), enabling fair longitudinal comparisons between different models/versions and providing reproducible performance evaluation basis for developers and researchers.

Section 02

[Background] Three Core Pain Points in the LLM Evaluation Field

The current LLM evaluation ecosystem has significant issues: 1. Unstable test sets: Most benchmarks continuously update questions, making results from different times incomparable; 2. Subjective scoring: Manual scoring is costly and standards are hard to unify; 3. Lack of historical data: Most tools only focus on single results and cannot track the evolution trajectory of models. These problems stem from the conflict between the concepts of "updating test sets" and "fair comparison."

Section 03

[Design Philosophy] Four Core Principles of Milo-Bench

The project design revolves around four key words:

Freeze: Test cases are never modified once locked; create a new ID if updates are needed;
Determinism: Abandon manual scoring; calculate scores (number of passes / total checks) via pure function check items (returning true/false);
Longitudinal: Store results (including timestamps, model versions, scores, etc.) using SQLite to support performance trend tracking;
Self-contained: No reliance on external resources; long texts are generated by deterministic algorithms, and code is executed in an isolated environment.

Section 04

[Technical Architecture] Evaluation System Covering Seven Capability Dimensions

Milo-Bench's evaluation system includes seven core categories:

Tool calling: Test tool selection, parameter passing, and ability to use tools appropriately;
Multi-step reasoning: Simulate workflows and check state consistency (e.g., configuration reading/conversion/writing);
Structured output: Generate content that meets format requirements, such as JSON and cron summaries;
Long context: Locate key information in massive text (e.g., needle in haystack);
Code ability: Verify code quality through programming tasks (e.g., LRU cache, IP parsing);
Cost efficiency: Evaluate resource usage such as token consumption and number of tool calls;
Agent workflow: Simulate end-to-end complex scenarios (6-15 tool calls).

Section 05

[Implementation Details] Check Mechanism for Deterministic Scoring

Milo-Bench achieves deterministic scoring through a variety of check types:

Tool call check: Verify whether a tool is called and whether parameters match (exact/regular expression/substring);
Output content check: String inclusion, regular expression matching;
JSON validation: Validity, field value/type, array length;
Code execution check: Run code and verify test cases;
Efficiency check: Monitor token count and key point count. Multi-step test process: Model calls tool → Executor returns mock response → Loop until completion or timeout, no need for real external resources.

Section 06

[Usage Guide] Execution and Report Generation

Milo-Bench provides a flexible command-line interface:

Run all evaluations: python bench.py --models all; specify version with --model-version;
Grouped execution: Support grouping by local/fast/heavy/cloud models, or filter by category;
Historical analysis: Use --compare to view model score trends, --leaderboard to generate rankings;
Report generation: HTML format includes visualizations such as rankings, trend charts, bar charts, and latency comparisons.

Section 07

[Insights and Recommendations] Value to the LLM Evaluation Ecosystem

Milo-Bench brings insights to the evaluation field:

Stability first: Prioritize high quality and stability of core tests over comprehensiveness;
Reproducibility engineering: Achieve systematic reproducibility through frozen tests and deterministic scoring;
Versioned management: Dual version mechanism (suite_version and spec_version) to balance stability and expansion. Recommendations for teams: Draw on its design ideas (freeze, determinism, persistence, self-containment) to build evaluation solutions suitable for their own needs.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15