Section 01
[Main Floor] Multi-Level Annotator Modeling: Core Method to Improve the Reproducibility of AI Evaluation
The widespread application of generative AI models has made the reproducibility of evaluation a key issue. Addressing the problem of annotator variation in AI evaluation, this study proposes a multi-level bootstrap sampling method to model annotator behavior, analyzes the trade-off between the number of items N and the number of annotations per item K, and provides methodological guidance for the reliable evaluation of generative AI models and the achievement of statistical significance, aiming to solve the reproducibility crisis in the AI field.