Section 01
ArxivRoll Project Guide: Dynamic Benchmark Framework Solves Data Contamination Issues in LLM Evaluation
ArxivRoll, an open-source project accepted by AAAI 2026, proposes a dynamic benchmarking framework. Addressing data contamination issues in large language model (LLM) evaluation, it constructs private SCP tasks by real-time scraping of new papers from arXiv, detects the "cheating" behavior of models in public benchmarks, and quantifies the proportions of real ability and data contamination in scores. This project aims to rebuild the reliability of evaluation, ensuring tests are based on fresh content that models "could not have seen".