Zing Forum

Reading

Revealing Reproducibility Illusions in Large Language Model APIs: Same Prompt, Different Answers

A study submitted to Nature Machine Intelligence systematically exposes the reproducibility issue of mainstream large language model (LLM) APIs producing inconsistent outputs under the same prompt.

大语言模型可复现性API可靠性AI研究方法论模型评估科学实验
Published 2026-05-11 09:18Recent activity 2026-05-11 10:27Estimated read 6 min
Revealing Reproducibility Illusions in Large Language Model APIs: Same Prompt, Different Answers
1

Section 01

[Introduction] Reproducibility Illusions in LLM APIs: Why Do Same Prompts Yield Different Outputs?

A study submitted to Nature Machine Intelligence systematically reveals the reproducibility issue of mainstream Large Language Model (LLM) APIs producing inconsistent outputs under the same prompt. This problem not only affects user experience but also touches the core of scientific research and practical applications—reproducibility. The genai-reproducibility-protocol project is quantifying this overlooked "reproducibility illusion" and proposing standardized solutions.

2

Section 02

Background: Reproducibility Crisis Undermines the Foundation of AI Research

Reproducibility is the cornerstone of scientific research. However, in the LLM field, even when controlling variables like prompts and model versions, API calls still produce different outputs, eroding the reliability of academic research. Worse still, many researchers do not fully recognize or report this issue, only presenting "representative" outputs, which may mislead judgments about model capabilities.

3

Section 03

Project and Methodology: Standardized Measurement of Reproducibility Issues

The genai-reproducibility-protocol project has been submitted to Nature Machine Intelligence (2026), with the core goal of establishing a standardized protocol to measure LLM API reproducibility. Key contributions include: standardized testing protocols, multi-model comparative analysis, quantification of influencing factors, and best practice recommendations. The measurement framework uses multiple calls (100+ times), with indicators covering response consistency rate, semantic similarity distribution, key information variation, confidence calibration, etc.

4

Section 04

Technical Roots: Four Major Causes of Output Differences Under the Same Prompt

The roots of inconsistent LLM API outputs include: 1. Randomness mechanisms (sampling strategies introduce variation; even with temperature 0, randomness may still exist); 2. Hardware and parallel computing (GPU scheduling leads to differences in operation order, which cumulatively affect outputs); 3. API opacity (commercial APIs have black-box characteristics; users cannot know hardware/weights/parameters); 4. Model update drift (silent background weight updates are not disclosed).

5

Section 05

Research Findings: Reproducibility Issues Are More Severe Than Expected

Preliminary results show that the consistency rate for certain tasks (e.g., code generation, mathematical reasoning) is below 50%, meaning that the "typical" results in papers may just be random samples. More worryingly, there are systematic biases in key information variation, and models may give contradictory factual statements without a warning mechanism.

6

Section 06

Impact and Recommendations: Response Strategies for Academia and Industry

For Academia: Call for mandatory reporting of statistical results from multiple runs, open-sourcing of experimental protocols, establishment of reproducibility benchmarks, and distinction between exploratory and confirmatory research. For Industry: Recommend using output aggregation (voting from multiple calls), deterministic modes, version locking, and internal confidence assessment mechanisms to reduce business risks.

7

Section 07

Future Directions: Unresolved Issues and Open Discussion

The project has initiated an important dialogue on LLM reliability, but there are still unresolved issues: How to balance creativity and determinism? How much transparency responsibility should API providers bear? Is there a technical solution to fundamentally solve reproducibility? The project team will continue to update the protocol and call on the community to participate in solving this issue.