Zing Forum

Reading

Reliability Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Reproducibility Framework

The research team proposed a medical Q&A evaluation framework that treats reproducibility as a primary metric. They found that even with low-temperature parameters, the models' self-consistency only reached 20%, revealing safety risks that single-round benchmark tests fail to detect.

医疗问答大语言模型可复现性模型评估医学AI一致性测试
Published 2026-04-12 16:56Recent activity 2026-04-14 10:18Estimated read 6 min
Reliability Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Reproducibility Framework
1

Section 01

Introduction: A New Framework for Evaluating the Reliability of Medical Q&A LLMs

Core Insights

The research team proposed a medical Q&A evaluation framework that takes reproducibility as a primary metric. Testing small open-source LLMs, they found that even with low-temperature parameters (T=0.2), the models' maximum self-consistency was only 20%, revealing safety risks that single-round benchmark tests fail to detect. This framework provides a more comprehensive model evaluation standard for the medical AI field.

2

Section 02

Special Challenges for Medical AI: The Need for Consistency

Online health communities are major channels for users to access medical information, but they are vulnerable to misinformation. Traditional evaluations only focus on single-inference accuracy and ignore the stability of models' answers to the same question. In medical scenarios, this instability may lead patients to receive conflicting advice, delaying treatment or causing anxiety.

3

Section 03

Evaluation Framework and Experimental Design

Evaluation Framework

  • Quality Dimension: Includes eight metrics such as BERTScore, ROUGE-L, and LLM-as-judge scores
  • Reproducibility Dimension: Calculates internal consistency metrics via repeated inference (10 runs per question)

Experimental Setup

  • Dataset: 50 medical questions from MedQuAD (1500 responses total)
  • Models: Llama 3.1 8B, Gemma 3 12B, MedGemma 1.5 4B
  • Parameters: Low temperature (T=0.2) to expect deterministic outputs
4

Section 04

Experimental Findings: Striking Consistency Defects

  • Low Self-Consistency: The maximum self-consistency of the three models was only 0.2, meaning there's only a 20% chance of completely consistent answers to the same question across multiple runs
  • High Output Uniqueness: 87%~97% of outputs were unique results
  • Challenges the single-round benchmark paradigm: High scores in single tests do not guarantee reliability in real-world deployment
5

Section 05

Model Comparison: Counterintuitive Results

MedGemma 1.5 4B, which has undergone clinical fine-tuning, underperformed larger general-purpose models (Llama 3.1 8B, Gemma 3 12B) in both quality and reproducibility. However, it should be noted that MedGemma is also the smallest model in terms of parameter count; it is impossible to determine whether the disadvantage comes from domain fine-tuning or scale effects, requiring more detailed experiments to separate these factors.

6

Section 06

Industry Implications: Redefining Evaluation Standards

  1. Reproducibility should be a primary metric: Medical LLMs must demonstrate stable outputs across multiple runs
  2. Single-round tests have blind spots: Statistical properties of multiple samples need to be considered
  3. Temperature parameters are not a panacea: Low temperatures do not guarantee output consistency; the source of randomness needs to be deeply understood
7

Section 07

Practical Applications and Open-Source Contributions

  • The research team open-sourced the complete experimental methods and data pipeline for practitioners to reproduce or extend the evaluation framework
  • Provides institutions with a systematic model selection process: Evaluate both quality and reproducibility to avoid being misled by a single metric
8

Section 08

Conclusion: Reliability is the Bottom Line for Medical LLMs

As LLM applications in the medical field expand, reliability requirements are increasing. This study incorporates reproducibility into core metrics, establishing a new evaluation standard. In medical scenarios, consistency is not an extra benefit but a fundamental requirement related to life and health.