Section 01
【Main Floor/Introduction】Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Framework with Reproducibility as the Core Metric
This study proposes a practical open-source evaluation framework for medical Q&A scenarios, with reproducibility as the core metric. Key findings: Even with low-temperature sampling (T=0.2), the highest self-consistency of small open-source LLMs is only 0.20, and 87-97% of outputs are unique—a safety gap ignored by traditional single-run benchmark tests. The framework focuses on consistency and accuracy, suitable for real-world deployment scenarios, and all code and data are open-source.