# Large-scale Scientist Assessment Reveals: Modern AI Lacks Imagination and Critical Negation Capability in Scientific Innovation

> A large-scale assessment covering 120,000 preprints and involving 6749 scientists— the largest of its kind— found three key limitations of current AI in scientific hypothesis generation: non-reasoning models fall into "groupthink", all models fail to spontaneously propose null hypotheses, and automatic evaluation has weak consistency with human experts' judgments.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-06T16:39:28.000Z
- 最近活动: 2026-06-09T02:21:19.466Z
- 热度: 84.3
- 关键词: AI for Science, 科学发现, 假设生成, 零假设, 人类反馈, 跨学科评估, LLM局限性, 科学推理
- 页面链接: https://www.zingnex.cn/en/forum/thread/ai-c90a8d49
- Canonical: https://www.zingnex.cn/forum/thread/ai-c90a8d49
- Markdown 来源: floors_fallback

---

## 【Introduction】Large-scale Scientist Assessment: Three Core Limitations of Modern AI in Scientific Innovation

A large-scale assessment that invited authors of 121,640 preprints and involved 6749 scientists found three core limitations of current AI in scientific hypothesis generation: non-reasoning models fall into "groupthink", all models fail to spontaneously propose null hypotheses, and automatic evaluation has weak consistency with human experts' judgments. The study also proposed a reward model based on human feedback, which can improve accuracy by 27%— approaching the consistency level of peer review.

## Research Background and Motivation

In recent years, optimistic predictions about AI accelerating scientific discovery have lacked empirical support. This study fills the gap by conducting the largest "scientist-in-the-loop" assessment to date. The research team invited authors of 121,640 recent preprints in biology, medicine, chemistry, and social sciences; eventually, 6749 scientists returned 25,139 sets of ratings, evaluating AI-generated follow-up research ideas from four dimensions: novelty, empirical feasibility, probability of being true, and willingness to adopt.

## Key Findings: Three Limitations of AI's Scientific Thinking

1. **Homogenized Thinking and Lack of Null Hypotheses**: Non-reasoning LLMs tend to fall into "groupthink", and all models cannot spontaneously propose null hypotheses (the core benchmark hypothesis in scientific research); 2. **Disciplinary Differences and Scientists' Preferences**: Social scientists are more tolerant of risk, senior scholars are stricter with AI-generated ideas, and scientists generally prefer ideas similar to their own views; 3. **Crisis in Automatic Evaluation Reliability**: Current automatic evaluation methods have weak consistency with human experts' judgments, and retrieval-augmented generation (RAG) and scientist personality prompts only bring marginal benefits.

## Breakthrough: Reward Model Based on Human Feedback

The research team proposed a post-training reward model based on human ratings. Using the Qwen3-14B model trained on 25,139 sets of human ratings, the results show: compared to SOTA models, accuracy increased by 27%, reaching the consistency level between independent peer reviewers, and effectively capturing differences in evaluation standards across different disciplines.

## Practical Implications and Future Directions

**Implications**: 1. AI is a collaborator that needs human guidance rather than a replacement; 2. Be alert to over-reliance on automatic evaluation metrics; 3. Pay attention to AI's performance differences across disciplines. **Improvement Directions**: Cultivate AI's critical negation thinking (proposing null hypotheses), systematically integrate human feedback into training and evaluation, and develop flexible systems that adapt across domains.

## Conclusion: AI-Human Collaboration is the Future of Scientific Innovation

Current AI lacks the ability to propose disruptive hypotheses and engage in critical negation; its ideas are confined to known paths. The most valuable scientific discoveries in the future will still require deep collaboration between humans and AI, and human wisdom remains the core of proposing transformative scientific questions.
