Section 01
ArxivRoll: A Dynamic Benchmark Tool to Address Data Contamination in Large Model Evaluation
ArxivRoll is a dynamic benchmark pipeline targeting data contamination issues in large language model evaluation. It uses a one-time filling framework and the SCP (Sequence Sorting, Cloze Test, Passage Prediction) task to convert fresh arXiv papers into private evaluation tasks. The data is only released after the evaluation is completed, aiming to detect and quantify the inflated phenomenon of model evaluation scores.