Reading

Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Framework with Reproducibility as the Core Metric

This study proposes a practical open-source evaluation framework for assessing the performance of small, locally deployable open-source LLMs on medical Q&A tasks. The findings show that even with low-temperature sampling (T=0.2), the highest self-consistency of models across multiple runs is only 0.20, and 87-97% of outputs are unique—a safety gap completely ignored by single-run benchmark tests.

医疗AILLM评估可复现性MedQuAD医疗问答模型一致性开源框架

Published 2026-04-12 16:56Recent activity 2026-04-24 17:56Estimated read 6 min

Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Framework with Reproducibility as the Core Metric

Section 01

【Main Floor/Introduction】Evaluation of Small Open-Source LLMs in Medical Q&A Scenarios: A Practical Framework with Reproducibility as the Core Metric

This study proposes a practical open-source evaluation framework for medical Q&A scenarios, with reproducibility as the core metric. Key findings: Even with low-temperature sampling (T=0.2), the highest self-consistency of small open-source LLMs is only 0.20, and 87-97% of outputs are unique—a safety gap ignored by traditional single-run benchmark tests. The framework focuses on consistency and accuracy, suitable for real-world deployment scenarios, and all code and data are open-source.

Section 02

Research Background: Special Needs for Consistency and Safety in Medical AI

Special Requirements for Medical AI

In the medical Q&A field, consistency, interpretability, safety, and accuracy are equally important; unstable outputs cannot serve as reliable tools.

Challenges in Online Health Communities

Platforms like Reddit are prone to misinformation, so deploying LLMs requires higher consistency and correctness.

Limitations of Existing Evaluations

Traditional evaluations only focus on single-run accuracy, ignoring output variability, safety boundaries, and clinical practicality issues.

Section 03

Evaluation Framework: Multi-Dimensional Metrics and the Core Role of Reproducibility

Core Design Philosophy

Treat reproducibility as a first-class metric, following principles of multi-dimensional evaluation, practical orientation, and open-source openness.

Metric System

Semantic Quality: BERTScore, ROUGE-L, LLM-as-Judge
Reproducibility: Self-consistency (similarity across multiple outputs), Output Uniqueness (proportion of distinct outputs)

Experimental Setup

Evaluate Llama3.1 8B, Gemma3 12B, and MedGemma1.5 4B on the MedQuAD dataset (50 questions), with 10 runs per question (total 1500 responses) and a sampling temperature of T=0.2.

Section 04

Key Findings: Severe Reproducibility Crisis Even Under Low-Temperature Sampling

Reproducibility Crisis

Even at T=0.2, the highest self-consistency of models is only 0.20, and 87-97% of outputs are unique—this safety gap is not captured by traditional evaluations.

Model Comparison

MedGemma1.5 4B (clinically fine-tuned) performs worse than larger general models (Llama3.1 8B, Gemma3 12B), but this confuses domain fine-tuning with scale effects.

Temperature Impact

T=0.2 still leads to highly variable outputs, indicating inherent randomness in models; medical applications require stronger deterministic mechanisms.

Section 05

Implications for Medical AI: Re-thinking Safety and Deployment Recommendations

Safety Considerations

Consistency equals safety: inconsistent outputs may lead to conflicting clinical recommendations
Need to quantify uncertainty and maintain human-in-the-loop decision-making

Upgrading Evaluation Standards

Multi-run evaluations should become standard; report confidence intervals instead of single-point estimates, and focus on worst-case scenarios

Deployment Recommendations

Integrate multiple models, add output validation layers, user warning mechanisms, and continuous consistency monitoring.

Section 06

Methodology and Open Source: Reusable Pipelines and Community Contributions

Methodological Contributions

Provide reproducible and scalable evaluation processes, and establish a workflow for model selection criteria.

Open Source Contributions

All code and data (evaluation scripts, metric implementations, visualization tools, etc.) have been open-sourced on GitHub for community reuse and extension.

Section 07

Limitations and Future Directions: Expansion Opportunities in Model Scale, Datasets, etc.

Current Limitations

Only small models (4B-12B), the MedQuAD dataset, and English scenarios are evaluated; large models, other datasets, and multilingual scenarios need to be verified.

Future Directions

Explore consistency improvement techniques, uncertainty calibration, domain adaptation, and real-time monitoring systems.

Section 08

Conclusion: Reproducibility is Key to Reliable Deployment of Medical AI

This study sounds an alarm for the responsible deployment of medical AI: high accuracy may mask reproducibility issues, which are critical in the high-risk medical field. Treating reproducibility as a core metric can build more reliable medical AI systems, and such evaluation frameworks are essential tools for patient safety.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15