Zing Forum

Reading

OLMo-Detect: A Multi-Stage Benchmark for Verbatim Contamination Detection in Large Language Models

The first verbatim memory detection benchmark for LLMs covering three stages—pre-training, mid-training, and post-training—including 9 domains, multi-size model evaluations, and comparisons of 12 detection methods.

LLM数据污染数据泄露基准测试OLMo记忆检测RAG模型评估AI安全
Published 2026-06-11 19:07Recent activity 2026-06-11 19:19Estimated read 6 min
OLMo-Detect: A Multi-Stage Benchmark for Verbatim Contamination Detection in Large Language Models
1

Section 01

Introduction / Main Floor: OLMo-Detect: A Multi-Stage Benchmark for Verbatim Contamination Detection in Large Language Models

The first verbatim memory detection benchmark for LLMs covering three stages—pre-training, mid-training, and post-training—including 9 domains, multi-size model evaluations, and comparisons of 12 detection methods.

2

Section 02

Original Authors and Source

  • Original Author/Maintainer: LuckyDaydreamer
  • Source Platform: GitHub
  • Original Title: OLMo-Detect: A Multi-Stage Benchmark for Verbatim Contamination Detection in Large Language Models
  • Original Link: https://github.com/LuckyDaydreamer/OLMo-Detect
  • Publication Date: June 11, 2026
  • Related Paper: Submitted to EMNLP 2026

3

Section 03

Research Background and Problem Definition

The problem of "data contamination" or "data leakage" in Large Language Models (LLMs) has long been a core challenge in the evaluation field. When there is overlap between training data and test data, the model may not truly "understand" the problem but merely "recite" the answers it has memorized. This phenomenon is particularly severe in scenarios such as academic benchmark testing and code generation evaluation.

However, existing contamination detection studies have several key limitations:

  1. Incomplete stage coverage: Most studies only focus on the pre-training stage, ignoring data contamination in mid-training and post-training stages.
  2. Single domain: Lack of systematic cross-domain evaluation.
  3. Inconsistent evaluation criteria: Different studies use different contamination definitions and detection thresholds, making results difficult to compare.

OLMo-Detect is a comprehensive benchmark suite designed to address these issues.


4

Section 04

Full Coverage of Three Stages

OLMo-Detect is built based on the OLMo 2 training process and covers all three stages of modern LLM training:

Pre-training Stage

Includes four core data sources:

  • DCLM-Baseline: Large-scale web text corpus
  • peS2o: Academic paper dataset
  • OpenWebMath: Math-specific corpus
  • StarCoder: Code dataset (Assembly and Java subsets)

Mid-training Stage

  • GSM8K: Math reasoning dataset
  • StackExchange: Q&A community data

Post-training Stage

  • SFT (Supervised Fine-Tuning): Supervised fine-tuning data (mix of Aya and WildChat)
  • DPO (Direct Preference Optimization): Preference optimization data
  • RLVR (Reinforcement Learning with Verifiable Rewards): Reinforcement learning data with verifiable rewards (GSM, MATH, IFEval)
5

Section 05

Data Alignment Strategies

The benchmark provides two data splitting methods:

Matched: Contaminated data and non-contaminated data are explicitly aligned in three dimensions—text quality, time range, and lexical similarity. This design ensures the fairness of detection method evaluation and eliminates interference from data distribution differences.

Shifted: Contaminated data is sampled without distribution alignment with non-contaminated data, used to test the robustness of detection methods in distribution shift scenarios.


6

Section 06

Core Detection Methods

OLMo-Detect implements 12 contamination detection methods covering multiple technical routes:

7

Section 07

Likelihood-Based Methods

  • Perplexity: Low perplexity implies the model is familiar with the text
  • Zlib Compression Ratio: Combines perplexity with text compression characteristics
  • Lowercase Perplexity: Detects memory via case conversion
  • Min-K% / Min-K%++: Detection based on the lowest k% token likelihood
8

Section 08

Retrieval-Based Methods

  • Recall: Based on n-gram recall rate
  • Neighborhood Attack: Neighborhood perturbation detection