# OLMo-Detect: A Multi-Stage Benchmark for Verbatim Contamination Detection in Large Language Models

> The first verbatim memory detection benchmark for LLMs covering three stages—pre-training, mid-training, and post-training—including 9 domains, multi-size model evaluations, and comparisons of 12 detection methods.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-11T11:07:23.000Z
- 最近活动: 2026-06-11T11:19:41.958Z
- 热度: 161.8
- 关键词: LLM, 数据污染, 数据泄露, 基准测试, OLMo, 记忆检测, RAG, 模型评估, AI安全
- 页面链接: https://www.zingnex.cn/en/forum/thread/olmo-detect
- Canonical: https://www.zingnex.cn/forum/thread/olmo-detect
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: OLMo-Detect: A Multi-Stage Benchmark for Verbatim Contamination Detection in Large Language Models

The first verbatim memory detection benchmark for LLMs covering three stages—pre-training, mid-training, and post-training—including 9 domains, multi-size model evaluations, and comparisons of 12 detection methods.

## Original Authors and Source

- **Original Author/Maintainer**: LuckyDaydreamer
- **Source Platform**: GitHub
- **Original Title**: OLMo-Detect: A Multi-Stage Benchmark for Verbatim Contamination Detection in Large Language Models
- **Original Link**: https://github.com/LuckyDaydreamer/OLMo-Detect
- **Publication Date**: June 11, 2026
- **Related Paper**: Submitted to EMNLP 2026

---

## Research Background and Problem Definition

The problem of "data contamination" or "data leakage" in Large Language Models (LLMs) has long been a core challenge in the evaluation field. When there is overlap between training data and test data, the model may not truly "understand" the problem but merely "recite" the answers it has memorized. This phenomenon is particularly severe in scenarios such as academic benchmark testing and code generation evaluation.

However, existing contamination detection studies have several key limitations:

1. **Incomplete stage coverage**: Most studies only focus on the pre-training stage, ignoring data contamination in mid-training and post-training stages.
2. **Single domain**: Lack of systematic cross-domain evaluation.
3. **Inconsistent evaluation criteria**: Different studies use different contamination definitions and detection thresholds, making results difficult to compare.

OLMo-Detect is a comprehensive benchmark suite designed to address these issues.

---

## Full Coverage of Three Stages

OLMo-Detect is built based on the OLMo 2 training process and covers all three stages of modern LLM training:

#### Pre-training Stage
Includes four core data sources:
- **DCLM-Baseline**: Large-scale web text corpus
- **peS2o**: Academic paper dataset
- **OpenWebMath**: Math-specific corpus
- **StarCoder**: Code dataset (Assembly and Java subsets)

#### Mid-training Stage
- **GSM8K**: Math reasoning dataset
- **StackExchange**: Q&A community data

#### Post-training Stage
- **SFT (Supervised Fine-Tuning)**: Supervised fine-tuning data (mix of Aya and WildChat)
- **DPO (Direct Preference Optimization)**: Preference optimization data
- **RLVR (Reinforcement Learning with Verifiable Rewards)**: Reinforcement learning data with verifiable rewards (GSM, MATH, IFEval)

## Data Alignment Strategies

The benchmark provides two data splitting methods:

**Matched**: Contaminated data and non-contaminated data are explicitly aligned in three dimensions—text quality, time range, and lexical similarity. This design ensures the fairness of detection method evaluation and eliminates interference from data distribution differences.

**Shifted**: Contaminated data is sampled without distribution alignment with non-contaminated data, used to test the robustness of detection methods in distribution shift scenarios.

---

## Core Detection Methods

OLMo-Detect implements 12 contamination detection methods covering multiple technical routes:

## Likelihood-Based Methods

- **Perplexity**: Low perplexity implies the model is familiar with the text
- **Zlib Compression Ratio**: Combines perplexity with text compression characteristics
- **Lowercase Perplexity**: Detects memory via case conversion
- **Min-K% / Min-K%++**: Detection based on the lowest k% token likelihood

## Retrieval-Based Methods

- **Recall**: Based on n-gram recall rate
- **Neighborhood Attack**: Neighborhood perturbation detection
