Section 01
OLMo-Detect: Guide to the Multi-Stage LLM Verbatim Contamination Detection Benchmark
OLMo-Detect is a multi-stage benchmark framework targeting verbatim memorization contamination in large language model (LLM) training, designed to identify whether models over-rely on specific text fragments from training data. Through a progressive process of coarse-grained screening, fine-grained verification, and contextual analysis, this framework addresses issues such as evaluation distortion and copyright/privacy concerns caused by data contamination, providing support for model evaluation and data cleaning.