# Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Framework with Reproducibility as the Core Metric

> This study proposes a practical open-source evaluation framework for assessing the performance of small, locally deployable open-source LLMs on medical Q&A tasks. The findings show that even with low-temperature sampling (T=0.2), the highest self-consistency of models across multiple runs is only 0.20, and 87-97% of outputs are unique—a safety gap completely ignored by single-run benchmark tests.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-12T08:56:15.000Z
- 最近活动: 2026-04-24T09:56:33.822Z
- 热度: 86.0
- 关键词: 医疗AI, LLM评估, 可复现性, MedQuAD, 医疗问答, 模型一致性, 开源框架
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-21cd2c18
- Canonical: https://www.zingnex.cn/forum/thread/llm-21cd2c18
- Markdown 来源: floors_fallback

---

## 【Main Floor/Introduction】Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Framework with Reproducibility as the Core Metric

This study proposes a practical open-source evaluation framework for medical Q&A scenarios, with reproducibility as the core metric. Key findings: Even with low-temperature sampling (T=0.2), the highest self-consistency of small open-source LLMs is only 0.20, and 87-97% of outputs are unique—a safety gap ignored by traditional single-run benchmark tests. The framework focuses on consistency and accuracy, suitable for real-world deployment scenarios, and all code and data are open-source.

## Research Background: Special Needs for Consistency and Safety in Medical AI

### Special Requirements for Medical AI
In the medical Q&A field, consistency, interpretability, safety, and accuracy are equally important; unstable outputs cannot serve as reliable tools.
### Challenges in Online Health Communities
Platforms like Reddit are prone to misinformation, so deploying LLMs requires higher consistency and correctness.
### Limitations of Existing Evaluations
Traditional evaluations only focus on single-run accuracy, ignoring output variability, safety boundaries, and clinical practicality issues.

## Evaluation Framework: Multi-Dimensional Metrics and the Core Role of Reproducibility

### Core Design Philosophy
Treat reproducibility as a first-class metric, following principles of multi-dimensional evaluation, practical orientation, and open-source openness.
### Metric System
- Semantic Quality: BERTScore, ROUGE-L, LLM-as-Judge
- Reproducibility: Self-consistency (similarity across multiple outputs), Output Uniqueness (proportion of distinct outputs)
### Experimental Setup
Evaluate Llama3.1 8B, Gemma3 12B, and MedGemma1.5 4B on the MedQuAD dataset (50 questions), with 10 runs per question (total 1500 responses) and a sampling temperature of T=0.2.

## Key Findings: Severe Reproducibility Crisis Even Under Low-Temperature Sampling

### Reproducibility Crisis
Even at T=0.2, the highest self-consistency of models is only 0.20, and 87-97% of outputs are unique—this safety gap is not captured by traditional evaluations.
### Model Comparison
- MedGemma1.5 4B (clinically fine-tuned) performs worse than larger general models (Llama3.1 8B, Gemma3 12B), but this confuses domain fine-tuning with scale effects.
### Temperature Impact
T=0.2 still leads to highly variable outputs, indicating inherent randomness in models; medical applications require stronger deterministic mechanisms.

## Implications for Medical AI: Re-thinking Safety and Deployment Recommendations

### Safety Considerations
- Consistency equals safety: inconsistent outputs may lead to conflicting clinical recommendations
- Need to quantify uncertainty and maintain human-in-the-loop decision-making
### Upgrading Evaluation Standards
- Multi-run evaluations should become standard; report confidence intervals instead of single-point estimates, and focus on worst-case scenarios
### Deployment Recommendations
Integrate multiple models, add output validation layers, user warning mechanisms, and continuous consistency monitoring.

## Methodology and Open Source: Reusable Pipelines and Community Contributions

### Methodological Contributions
Provide reproducible and scalable evaluation processes, and establish a workflow for model selection criteria.
### Open Source Contributions
All code and data (evaluation scripts, metric implementations, visualization tools, etc.) have been open-sourced on GitHub for community reuse and extension.

## Limitations and Future Directions: Expansion Opportunities in Model Scale, Datasets, etc.

### Current Limitations
Only small models (4B-12B), the MedQuAD dataset, and English scenarios are evaluated; large models, other datasets, and multilingual scenarios need to be verified.
### Future Directions
Explore consistency improvement techniques, uncertainty calibration, domain adaptation, and real-time monitoring systems.

## Conclusion: Reproducibility is Key to Reliable Deployment of Medical AI

This study sounds an alarm for the responsible deployment of medical AI: high accuracy may mask reproducibility issues, which are critical in the high-risk medical field. Treating reproducibility as a core metric can build more reliable medical AI systems, and such evaluation frameworks are essential tools for patient safety.
