# ArxivRoll: Using Large Models to Evaluate Large Models—How to Identify Inflated Scores Caused by "Data Contamination"?

> ArxivRoll, an open-source project from an AAAI 2026 paper, proposes a dynamic benchmarking framework. By real-time scraping of papers from arXiv and constructing private SCP tasks, it detects the "cheating" behavior of large language models (LLMs) in public benchmarks and quantifies the proportions of real ability and data contamination in evaluation scores.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-18T12:13:08.000Z
- 最近活动: 2026-05-18T12:18:43.707Z
- 热度: 150.9
- 关键词: 大语言模型, 基准测试, 数据污染, arXiv, 机器学习评估, AAAI 2026, 动态基准, 模型能力评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/arxivroll
- Canonical: https://www.zingnex.cn/forum/thread/arxivroll
- Markdown 来源: floors_fallback

---

## ArxivRoll Project Guide: Dynamic Benchmark Framework Solves Data Contamination Issues in LLM Evaluation

ArxivRoll, an open-source project accepted by AAAI 2026, proposes a dynamic benchmarking framework. Addressing data contamination issues in large language model (LLM) evaluation, it constructs private SCP tasks by real-time scraping of new papers from arXiv, detects the "cheating" behavior of models in public benchmarks, and quantifies the proportions of real ability and data contamination in scores. This project aims to rebuild the reliability of evaluation, ensuring tests are based on fresh content that models "could not have seen".

## Background: Data Contamination Erodes Benchmark Reliability

LLM capability evaluation relies on benchmarks like GLUE and MMLU, but data contamination (training corpora containing test set content) leads to inflated scores—models may perform well because they "memorized answers" rather than truly mastering the ability. Traditional countermeasures (creating new test sets, dynamic question banks) treat the symptoms but not the root cause, and cannot quantify the contamination proportion. This is the core problem ArxivRoll aims to solve.

## Core Methods: Dynamic Private SCP Task Framework and Round Mechanism

ArxivRoll is a dynamic benchmark pipeline that uses new arXiv papers to construct private tasks (which models could not have seen) and adopts a "one-time use" philosophy to avoid task leakage. The core is the SCP task framework:
1. **Sorting Task (S)**：Shuffle text fragments and rearrange them to test logical structure understanding;
2. **Cloze Task (C)**：Mask sentences and select correct options to simulate contextual inference;
3. **Prediction Task (P)**：Choose subsequent content to understand writing patterns.
The technical process includes paper scraping and preprocessing, task construction, and evaluation aggregation; the round mechanism is organized by time windows (e.g., 2024b completed, 2025a ongoing), covering 8 subject areas and tracking model capability changes.

## Research Findings: Quantifying Inflated Scores Caused by Data Contamination

By comparing model performance on public benchmarks and ArxivRoll private tasks, the proportion of data contamination in inflated scores can be quantified (e.g., MMLU score of 90% vs ArxivRoll's 60%—the gap may be the impact of contamination). This framework provides a continuous monitoring mechanism, generating new test rounds as new papers are published to ensure evaluations are based on fresh content.

## Usage Guide: Environment Setup and Running Steps

The project provides a complete reproduction environment:
- Environment setup: conda (`conda env create -f robench.yaml`) or pip (`pip install -r re.txt`);
- Clone the evaluation framework: `git clone https://github.com/liangzid/harness-4-arxivrollbench`;
- Running process: Scrape papers → Construct tasks → Evaluate models → Aggregate results to generate leaderboards.

## Limitations and Future Improvement Directions

**Limitations**:
1. Subject bias towards STEM fields, with less coverage of humanities and social sciences;
2. Single task type (focusing on text understanding and reasoning);
3. English-centric, unfair to non-English models.
**Future Directions**: Expand data sources (SSRN, PubMed Central), add multilingual support, develop new tasks like chart question answering, and refine the capability decomposition framework.

## Conclusion: Rebuilding Evaluation Trust and Paradigm Shift

ArxivRoll is not just a tool; it also promotes a paradigm shift in evaluation thinking—from "preventing models from seeing the test set" to "ensuring the test set has absolutely not been seen". In today's era of rapid LLM development, we need to treat benchmark scores carefully; what truly matters is the model's real understanding and reasoning ability when facing unknown content. This project provides tools for researchers and points out the direction for improving the evaluation system for the AI community.
