Reading

OLMo-Detect: A Multi-Stage Benchmark for Verbatim Contamination Detection in Large Language Models

The first verbatim memory detection benchmark for LLMs covering three stages—pre-training, mid-training, and post-training—including 9 domains, multi-size model evaluations, and comparisons of 12 detection methods.

LLM数据污染数据泄露基准测试OLMo记忆检测RAG模型评估AI安全

Published 2026-06-11 19:07Recent activity 2026-06-11 19:19Estimated read 6 min

Section 01

Introduction / Main Floor: OLMo-Detect: A Multi-Stage Benchmark for Verbatim Contamination Detection in Large Language Models

Section 02

Original Authors and Source

Original Author/Maintainer: LuckyDaydreamer
Source Platform: GitHub
Original Title: OLMo-Detect: A Multi-Stage Benchmark for Verbatim Contamination Detection in Large Language Models
Original Link: https://github.com/LuckyDaydreamer/OLMo-Detect
Publication Date: June 11, 2026
Related Paper: Submitted to EMNLP 2026

Section 03

Research Background and Problem Definition

The problem of "data contamination" or "data leakage" in Large Language Models (LLMs) has long been a core challenge in the evaluation field. When there is overlap between training data and test data, the model may not truly "understand" the problem but merely "recite" the answers it has memorized. This phenomenon is particularly severe in scenarios such as academic benchmark testing and code generation evaluation.

However, existing contamination detection studies have several key limitations:

Incomplete stage coverage: Most studies only focus on the pre-training stage, ignoring data contamination in mid-training and post-training stages.
Single domain: Lack of systematic cross-domain evaluation.
Inconsistent evaluation criteria: Different studies use different contamination definitions and detection thresholds, making results difficult to compare.

OLMo-Detect is a comprehensive benchmark suite designed to address these issues.

Section 04

Full Coverage of Three Stages

OLMo-Detect is built based on the OLMo 2 training process and covers all three stages of modern LLM training:

Pre-training Stage

Includes four core data sources:

DCLM-Baseline: Large-scale web text corpus
peS2o: Academic paper dataset
OpenWebMath: Math-specific corpus
StarCoder: Code dataset (Assembly and Java subsets)

Mid-training Stage

GSM8K: Math reasoning dataset
StackExchange: Q&A community data

Post-training Stage

SFT (Supervised Fine-Tuning): Supervised fine-tuning data (mix of Aya and WildChat)
DPO (Direct Preference Optimization): Preference optimization data
RLVR (Reinforcement Learning with Verifiable Rewards): Reinforcement learning data with verifiable rewards (GSM, MATH, IFEval)

Section 05

Data Alignment Strategies

The benchmark provides two data splitting methods:

Matched: Contaminated data and non-contaminated data are explicitly aligned in three dimensions—text quality, time range, and lexical similarity. This design ensures the fairness of detection method evaluation and eliminates interference from data distribution differences.

Shifted: Contaminated data is sampled without distribution alignment with non-contaminated data, used to test the robustness of detection methods in distribution shift scenarios.

Section 06

Core Detection Methods

OLMo-Detect implements 12 contamination detection methods covering multiple technical routes:

Section 07

Likelihood-Based Methods

Perplexity: Low perplexity implies the model is familiar with the text
Zlib Compression Ratio: Combines perplexity with text compression characteristics
Lowercase Perplexity: Detects memory via case conversion
Min-K% / Min-K%++: Detection based on the lowest k% token likelihood

Section 08

Retrieval-Based Methods

Recall: Based on n-gram recall rate
Neighborhood Attack: Neighborhood perturbation detection