Reading

ArxivRoll: Using Large Models to Evaluate Large Models—How to Identify Inflated Scores Caused by "Data Contamination"?

ArxivRoll, an open-source project from an AAAI 2026 paper, proposes a dynamic benchmarking framework. By real-time scraping of papers from arXiv and constructing private SCP tasks, it detects the "cheating" behavior of large language models (LLMs) in public benchmarks and quantifies the proportions of real ability and data contamination in evaluation scores.

大语言模型基准测试数据污染arXiv机器学习评估AAAI 2026动态基准模型能力评估

Published 2026-05-18 20:13Recent activity 2026-05-18 20:18Estimated read 6 min

ArxivRoll: Using Large Models to Evaluate Large Models—How to Identify Inflated Scores Caused by "Data Contamination"?

Section 01

ArxivRoll Project Guide: Dynamic Benchmark Framework Solves Data Contamination Issues in LLM Evaluation

ArxivRoll, an open-source project accepted by AAAI 2026, proposes a dynamic benchmarking framework. Addressing data contamination issues in large language model (LLM) evaluation, it constructs private SCP tasks by real-time scraping of new papers from arXiv, detects the "cheating" behavior of models in public benchmarks, and quantifies the proportions of real ability and data contamination in scores. This project aims to rebuild the reliability of evaluation, ensuring tests are based on fresh content that models "could not have seen".

Section 02

Background: Data Contamination Erodes Benchmark Reliability

LLM capability evaluation relies on benchmarks like GLUE and MMLU, but data contamination (training corpora containing test set content) leads to inflated scores—models may perform well because they "memorized answers" rather than truly mastering the ability. Traditional countermeasures (creating new test sets, dynamic question banks) treat the symptoms but not the root cause, and cannot quantify the contamination proportion. This is the core problem ArxivRoll aims to solve.

Section 03

Core Methods: Dynamic Private SCP Task Framework and Round Mechanism

ArxivRoll is a dynamic benchmark pipeline that uses new arXiv papers to construct private tasks (which models could not have seen) and adopts a "one-time use" philosophy to avoid task leakage. The core is the SCP task framework:

Sorting Task (S)：Shuffle text fragments and rearrange them to test logical structure understanding;
Cloze Task (C)：Mask sentences and select correct options to simulate contextual inference;
Prediction Task (P)：Choose subsequent content to understand writing patterns. The technical process includes paper scraping and preprocessing, task construction, and evaluation aggregation; the round mechanism is organized by time windows (e.g., 2024b completed, 2025a ongoing), covering 8 subject areas and tracking model capability changes.

Section 04

Research Findings: Quantifying Inflated Scores Caused by Data Contamination

By comparing model performance on public benchmarks and ArxivRoll private tasks, the proportion of data contamination in inflated scores can be quantified (e.g., MMLU score of 90% vs ArxivRoll's 60%—the gap may be the impact of contamination). This framework provides a continuous monitoring mechanism, generating new test rounds as new papers are published to ensure evaluations are based on fresh content.

Section 05

Usage Guide: Environment Setup and Running Steps

The project provides a complete reproduction environment:

Environment setup: conda (conda env create -f robench.yaml) or pip (pip install -r re.txt);
Clone the evaluation framework: git clone https://github.com/liangzid/harness-4-arxivrollbench;
Running process: Scrape papers → Construct tasks → Evaluate models → Aggregate results to generate leaderboards.

Section 06

Limitations and Future Improvement Directions

Limitations:

Subject bias towards STEM fields, with less coverage of humanities and social sciences;
Single task type (focusing on text understanding and reasoning);
English-centric, unfair to non-English models. Future Directions: Expand data sources (SSRN, PubMed Central), add multilingual support, develop new tasks like chart question answering, and refine the capability decomposition framework.

Section 07

Conclusion: Rebuilding Evaluation Trust and Paradigm Shift

ArxivRoll is not just a tool; it also promotes a paradigm shift in evaluation thinking—from "preventing models from seeing the test set" to "ensuring the test set has absolutely not been seen". In today's era of rapid LLM development, we need to treat benchmark scores carefully; what truly matters is the model's real understanding and reasoning ability when facing unknown content. This project provides tools for researchers and points out the direction for improving the evaluation system for the AI community.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54