# MBE Protocol: Establishing a Standardized Evaluation System for KV Cache Compression in Large Models

> Matched-Budget Evaluation (MBE) is a standardized fixed-budget reporting protocol and open-source evaluation framework for KV cache compression methods in large language models, aiming to address the issue of incomparable evaluation results between academia and industry.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-11T23:44:15.000Z
- 最近活动: 2026-06-11T23:49:44.400Z
- 热度: 146.9
- 关键词: KV缓存压缩, 大语言模型, 评估协议, LLM推理优化, 开源框架, 标准化评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/mbe-kv
- Canonical: https://www.zingnex.cn/forum/thread/mbe-kv
- Markdown 来源: floors_fallback

---

## MBE Protocol: Introduction to the Standardized Evaluation System for KV Cache Compression in Large Models

The Matched-Budget Evaluation (MBE) protocol is a standardized fixed-budget reporting protocol and open-source evaluation framework for KV cache compression methods in large language models. It aims to resolve the fragmented issue of incomparable evaluation results in the current KV cache compression field. Its core idea is to compare methods under the same reserved KV memory budget. Through fixed budget tiers and a multi-dimensional evaluation matrix, different research results can be directly compared.

## Background: The Fragmented Dilemma of KV Cache Compression Evaluation

In LLM inference, KV cache is the main source of memory consumption, and its linear growth with sequence length becomes a bottleneck. Although there are various compression methods such as quantization and pruning, different studies use different models, tasks, and metrics, and even lack systematic measurement, leading to results that cannot be directly compared, making it difficult for researchers and engineers to select appropriate methods.

## MBE Core Idea and Standardized Budget Tiers

The core of MBE is to compare methods under the same reserved KV memory budget. It is not a new benchmark but a lightweight reporting layer that is compatible with existing task suites (such as LongBench, GSM8K, etc.). It defines fixed budget tiers: B50 (50%), B25 (25%), B12 (12.5%), B06 (6.25%, optional), which facilitates observing performance curves under different compression intensities.

## MBE's Comprehensive Evaluation Dimension Matrix

MBE requires reporting multi-dimensional metrics at each budget point:
- Model dimension: Covers 7-8B GQA, 7-14B, and ≥70B models
- Task dimension: Retrieval, aggregation/tracking, instruction following, reasoning, agent/multi-turn tasks
- System dimension: Peak memory, throughput, first token time, maximum batch size, hardware level
- Method dimension: Deployment prerequisites (training-free/calibration/pretraining), composability.

## MBE Open-Source Evaluation Framework Design

MBE provides an adapter-based open-source framework. Researchers only need to implement the `KVCompressor` interface, and the framework automatically handles budget scanning, task execution, and metric collection. Built-in reference adapters include KIVI (2-bit quantization), H2O (dynamic eviction), SnapKV, StreamingLLM, PyramidKV, etc., which lowers the evaluation threshold.

## MBE Community Contribution and Quick Start

MBE adopts an open contribution model. Researchers can submit evaluation cards (via PR), and CI automatically updates the leaderboard. Quick start steps:
1. Configure methods and running parameters using YAML
2. Run `run_mbe.py` to generate evaluation cards
3. Render the cards and submit a PR.

## MBE's Significance and Future Outlook

MBE not only solves the fragmented problem of KV cache compression evaluation but also represents a new paradigm for scientific research collaboration. Industry can select methods objectively, and academia can lower the evaluation threshold. As LLM context windows expand, the importance of KV compression increases, and MBE is expected to become the infrastructure in this field, promoting more comparable and reproducible research.
