# LLM Automated Reproducibility Assessment: A New Paradigm for Verifying Social Science Research

> This study demonstrates how to use Large Language Models (LLMs) to automate reproducibility assessments in social and behavioral sciences. In an analysis of 76 published studies, LLMs achieved 96% consistency in qualitative conclusions, surpassing the 74% of human re-analysts, providing a scalable new tool for systematic auditing of empirical results.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-11T17:58:36.000Z
- 最近活动: 2026-06-12T03:54:27.477Z
- 热度: 145.1
- 关键词: 可重复性, 大语言模型, 社会科学, 行为科学, 研究验证, 效应量, 自动化评估, 科学研究, 统计分析, 研究审计
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-3c065fea
- Canonical: https://www.zingnex.cn/forum/thread/llm-3c065fea
- Markdown 来源: floors_fallback

---

## [Introduction] LLM Automated Reproducibility Assessment: A New Paradigm for Verifying Social Science Research

This study comes from the paper 'Automated reproducibility assessments in the social and behavioral sciences using large language models' published on arXiv in June 2026. It explores the use of Large Language Models (LLMs) to automate reproducibility assessments in social and behavioral sciences. An analysis of 76 published studies found that LLMs achieved 96% consistency in qualitative conclusions, surpassing the 74% of human re-analysts, providing a scalable new tool for systematic auditing of empirical results.

## Background: Reproducibility Crisis in Social Sciences and Dilemmas of Traditional Assessment

Over the past decade, the scientific community has faced a reproducibility crisis, with many published results being difficult to replicate—this is particularly prominent in social and behavioral sciences (due to complex statistical methods, subjective data coding, etc.). Traditional approaches rely on human re-analysts, but they have limitations such as high resource consumption, slow speed, and difficulty in scaling, which has spurred the need for more efficient assessment methods.

## Research Design and Methods

Seventy-six social/behavioral science studies with explicit hypothesis statements were selected. The assessment process includes: 1. Obtain the dataset and analysis code of the original study; 2. Build an automated pipeline for LLMs to re-analyze and calculate effect sizes; 3. Hire professional statisticians to conduct independent re-analysis; 4. Compare the results of LLMs, humans, and the original findings. Evaluation metrics: Quantitative (effect size recovery rate, with a tolerance of Cohen's d ±0.05) and qualitative (conclusion consistency, binary judgment on whether the original hypothesis is supported).

## Research Results: LLMs Outperform Human Analysts Across the Board

Among the 69 studies with valid effect size estimates: LLMs had an effect size recovery rate of 41% vs. 34% for humans; in terms of qualitative conclusion consistency, LLMs reached 96% while humans only had 74%, showing a significant gap. This reflects the problem of non-standard effect size reporting in social science research rather than flaws in the tool.

## Core Reasons for LLMs' Superior Performance

1. Reduced human errors (code transcription, parameter setting, etc.); 2. Standardized analysis process (unified steps to avoid deviations); 3. Not affected by cognitive biases (no confirmation bias, anchoring effect); 4. Unlimited patience and consistency (no fluctuations due to fatigue).

## Limitations of the Current Method

1. LLMs could not generate valid effect sizes for 9% of the studies (due to complex data, unclear method descriptions, etc.); 2. Dependence on the quality of original data/code; 3. Black box problem (opaque decision-making process); 4. Lack of deep domain expertise.

## Implications for the Scientific Community and Future Outlook

Implications: Democratization of reproducibility assessment (reducing costs), enabling systematic auditing, promoting standardization of research practices (data/method/code norms), and a new model of human-machine collaboration (LLM screening + human in-depth judgment). Outlook: Expand to more disciplines, handle complex experimental designs, establish assessment standards, and integrate into journal publishing processes.
