Zing Forum

Reading

CoCoReviewBench: An Evaluation Benchmark for AI Reviewers Focused on Completeness and Correctness

This article introduces CoCoReviewBench, a new evaluation benchmark for AI review systems. By focusing on completeness and correctness rather than simple text overlap with human reviews, it addresses the core issues in current AI review assessment and builds a reliable evaluation system based on 3900 papers from ICLR and NeurIPS.

AI审稿评测基准完整性正确性学术评审幻觉问题
Published 2026-05-08 23:44Recent activity 2026-05-11 12:21Estimated read 6 min
CoCoReviewBench: An Evaluation Benchmark for AI Reviewers Focused on Completeness and Correctness
1

Section 01

CoCoReviewBench: Introduction to the New AI Reviewer Evaluation Benchmark

This article introduces CoCoReviewBench, a new evaluation benchmark for AI review systems. By focusing on completeness and correctness rather than simple text overlap with human reviews, it addresses the core issues in current AI review assessment and builds a reliable evaluation system based on 3900 papers from ICLR and NeurIPS. This benchmark provides a new evaluation paradigm for the development of AI review technology.

2

Section 02

Evaluation Dilemmas of AI Review Systems

With the improvement of large language model capabilities, AI-assisted paper review has become a hot topic, but scientifically evaluating the performance of AI reviews remains a challenge. Existing metrics mostly measure the text overlap between AI reviews and human reviews, which has fundamental flaws: human reviews may only cover some key issues or contain incorrect judgments, and AI imitating surface features will mask its limitations, hindering the healthy development of the technology.

3

Section 03

Completeness and Correctness: Design of Dual Evaluation Dimensions

The core innovation of CoCoReviewBench lies in proposing two independent evaluation dimensions: Completeness and Correctness. Completeness focuses on whether the AI covers all key issues of the paper, and avoids imposing human review omissions as AI errors through the construction of category-specific subsets; Correctness focuses on whether the issues pointed out by the AI are real and reasonable, using expert annotations from the reviewer-author-meta-review discussion chain to filter unreliable content.

4

Section 04

Dataset Construction and Scale of CoCoReviewBench

This benchmark integrates 3900 papers and related review data from two top conferences, ICLR and NeurIPS, and its scale leads among similar benchmarks. The dataset construction considers domain diversity and review quality screening to ensure the reliability and generalization of evaluation results.

5

Section 05

Key Findings: Current Status and Limitations of AI Reviews

Analysis based on CoCoReviewBench reveals: Current AI review systems have obvious limitations in correctness, and are prone to hallucination problems (pointing out non-existent flaws), especially in complex technical papers; Reasoning models perform better in review quality than traditional direct generation models, and enhancing reasoning ability is the key path to improve review quality.

6

Section 06

Implications for Academic Publishing

The release of CoCoReviewBench provides a reliable evaluation tool for AI review technology, establishes a new evaluation paradigm from imitating humans to pursuing substantive quality, accelerates the practical application of AI-assisted review systems, and makes them a powerful tool to reduce the burden on reviewers and improve review quality.

7

Section 07

Open Source and Community Contributions

The research team has open-sourced the benchmark dataset and evaluation models, providing resources for subsequent research in academia and industry. The open attitude helps form a healthy technical ecosystem, promotes the transition of AI review technology from the laboratory to practical applications, and the community can carry out algorithm improvement, model comparison, and methodology research based on this.

8

Section 08

Future Research Directions

Based on preliminary results, future research can explore: How to reduce hallucination problems in AI reviews? How to design more effective reasoning mechanisms to enhance the model's understanding of complex technical content? How to balance the completeness and accuracy of reviews? CoCoReviewBench provides a solid starting point for these studies.