Zing Forum

Reading

Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Framework with Reproducibility as the Core Metric

This study proposes a practical open-source evaluation framework for assessing the performance of small, locally deployable open-source LLMs on medical Q&A tasks. The findings show that even with low-temperature sampling (T=0.2), the highest self-consistency of models across multiple runs is only 0.20, and 87-97% of outputs are unique—a safety gap completely ignored by single-run benchmark tests.

医疗AILLM评估可复现性MedQuAD医疗问答模型一致性开源框架
Published 2026-04-12 16:56Recent activity 2026-04-24 17:56Estimated read 6 min
Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Framework with Reproducibility as the Core Metric
1

Section 01

【Main Floor/Introduction】Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Framework with Reproducibility as the Core Metric

This study proposes a practical open-source evaluation framework for medical Q&A scenarios, with reproducibility as the core metric. Key findings: Even with low-temperature sampling (T=0.2), the highest self-consistency of small open-source LLMs is only 0.20, and 87-97% of outputs are unique—a safety gap ignored by traditional single-run benchmark tests. The framework focuses on consistency and accuracy, suitable for real-world deployment scenarios, and all code and data are open-source.

2

Section 02

Research Background: Special Needs for Consistency and Safety in Medical AI

Special Requirements for Medical AI

In the medical Q&A field, consistency, interpretability, safety, and accuracy are equally important; unstable outputs cannot serve as reliable tools.

Challenges in Online Health Communities

Platforms like Reddit are prone to misinformation, so deploying LLMs requires higher consistency and correctness.

Limitations of Existing Evaluations

Traditional evaluations only focus on single-run accuracy, ignoring output variability, safety boundaries, and clinical practicality issues.

3

Section 03

Evaluation Framework: Multi-Dimensional Metrics and the Core Role of Reproducibility

Core Design Philosophy

Treat reproducibility as a first-class metric, following principles of multi-dimensional evaluation, practical orientation, and open-source openness.

Metric System

  • Semantic Quality: BERTScore, ROUGE-L, LLM-as-Judge
  • Reproducibility: Self-consistency (similarity across multiple outputs), Output Uniqueness (proportion of distinct outputs)

Experimental Setup

Evaluate Llama3.1 8B, Gemma3 12B, and MedGemma1.5 4B on the MedQuAD dataset (50 questions), with 10 runs per question (total 1500 responses) and a sampling temperature of T=0.2.

4

Section 04

Key Findings: Severe Reproducibility Crisis Even Under Low-Temperature Sampling

Reproducibility Crisis

Even at T=0.2, the highest self-consistency of models is only 0.20, and 87-97% of outputs are unique—this safety gap is not captured by traditional evaluations.

Model Comparison

  • MedGemma1.5 4B (clinically fine-tuned) performs worse than larger general models (Llama3.1 8B, Gemma3 12B), but this confuses domain fine-tuning with scale effects.

Temperature Impact

T=0.2 still leads to highly variable outputs, indicating inherent randomness in models; medical applications require stronger deterministic mechanisms.

5

Section 05

Implications for Medical AI: Re-thinking Safety and Deployment Recommendations

Safety Considerations

  • Consistency equals safety: inconsistent outputs may lead to conflicting clinical recommendations
  • Need to quantify uncertainty and maintain human-in-the-loop decision-making

Upgrading Evaluation Standards

  • Multi-run evaluations should become standard; report confidence intervals instead of single-point estimates, and focus on worst-case scenarios

Deployment Recommendations

Integrate multiple models, add output validation layers, user warning mechanisms, and continuous consistency monitoring.

6

Section 06

Methodology and Open Source: Reusable Pipelines and Community Contributions

Methodological Contributions

Provide reproducible and scalable evaluation processes, and establish a workflow for model selection criteria.

Open Source Contributions

All code and data (evaluation scripts, metric implementations, visualization tools, etc.) have been open-sourced on GitHub for community reuse and extension.

7

Section 07

Limitations and Future Directions: Expansion Opportunities in Model Scale, Datasets, etc.

Current Limitations

Only small models (4B-12B), the MedQuAD dataset, and English scenarios are evaluated; large models, other datasets, and multilingual scenarios need to be verified.

Future Directions

Explore consistency improvement techniques, uncertainty calibration, domain adaptation, and real-time monitoring systems.

8

Section 08

Conclusion: Reproducibility is Key to Reliable Deployment of Medical AI

This study sounds an alarm for the responsible deployment of medical AI: high accuracy may mask reproducibility issues, which are critical in the high-risk medical field. Treating reproducibility as a core metric can build more reliable medical AI systems, and such evaluation frameworks are essential tools for patient safety.