Zing Forum

Reading

Large-scale Scientist Assessment Reveals: Modern AI Lacks Imagination and Critical Negation Capability in Scientific Innovation

A large-scale assessment covering 120,000 preprints and involving 6749 scientists— the largest of its kind— found three key limitations of current AI in scientific hypothesis generation: non-reasoning models fall into "groupthink", all models fail to spontaneously propose null hypotheses, and automatic evaluation has weak consistency with human experts' judgments.

AI for Science科学发现假设生成零假设人类反馈跨学科评估LLM局限性科学推理
Published 2026-06-07 00:39Recent activity 2026-06-09 10:21Estimated read 5 min
Large-scale Scientist Assessment Reveals: Modern AI Lacks Imagination and Critical Negation Capability in Scientific Innovation
1

Section 01

【Introduction】Large-scale Scientist Assessment: Three Core Limitations of Modern AI in Scientific Innovation

A large-scale assessment that invited authors of 121,640 preprints and involved 6749 scientists found three core limitations of current AI in scientific hypothesis generation: non-reasoning models fall into "groupthink", all models fail to spontaneously propose null hypotheses, and automatic evaluation has weak consistency with human experts' judgments. The study also proposed a reward model based on human feedback, which can improve accuracy by 27%— approaching the consistency level of peer review.

2

Section 02

Research Background and Motivation

In recent years, optimistic predictions about AI accelerating scientific discovery have lacked empirical support. This study fills the gap by conducting the largest "scientist-in-the-loop" assessment to date. The research team invited authors of 121,640 recent preprints in biology, medicine, chemistry, and social sciences; eventually, 6749 scientists returned 25,139 sets of ratings, evaluating AI-generated follow-up research ideas from four dimensions: novelty, empirical feasibility, probability of being true, and willingness to adopt.

3

Section 03

Key Findings: Three Limitations of AI's Scientific Thinking

  1. Homogenized Thinking and Lack of Null Hypotheses: Non-reasoning LLMs tend to fall into "groupthink", and all models cannot spontaneously propose null hypotheses (the core benchmark hypothesis in scientific research); 2. Disciplinary Differences and Scientists' Preferences: Social scientists are more tolerant of risk, senior scholars are stricter with AI-generated ideas, and scientists generally prefer ideas similar to their own views; 3. Crisis in Automatic Evaluation Reliability: Current automatic evaluation methods have weak consistency with human experts' judgments, and retrieval-augmented generation (RAG) and scientist personality prompts only bring marginal benefits.
4

Section 04

Breakthrough: Reward Model Based on Human Feedback

The research team proposed a post-training reward model based on human ratings. Using the Qwen3-14B model trained on 25,139 sets of human ratings, the results show: compared to SOTA models, accuracy increased by 27%, reaching the consistency level between independent peer reviewers, and effectively capturing differences in evaluation standards across different disciplines.

5

Section 05

Practical Implications and Future Directions

Implications: 1. AI is a collaborator that needs human guidance rather than a replacement; 2. Be alert to over-reliance on automatic evaluation metrics; 3. Pay attention to AI's performance differences across disciplines. Improvement Directions: Cultivate AI's critical negation thinking (proposing null hypotheses), systematically integrate human feedback into training and evaluation, and develop flexible systems that adapt across domains.

6

Section 06

Conclusion: AI-Human Collaboration is the Future of Scientific Innovation

Current AI lacks the ability to propose disruptive hypotheses and engage in critical negation; its ideas are confined to known paths. The most valuable scientific discoveries in the future will still require deep collaboration between humans and AI, and human wisdom remains the core of proposing transformative scientific questions.