Reading

Safety and Accuracy of Clinical Large Language Models Follow Different Scaling Laws

This study proposes the SaFE-Scale framework and RadSaFE-200 benchmark, systematically evaluating the safety performance of 34 clinical LLMs under six deployment conditions. Key findings: Improved accuracy does not automatically lead to enhanced safety; high-quality evidence has the most significant impact on safety improvement, while standard RAG and agent-based RAG fail to replicate this safety feature.

临床LLMAI安全医疗AI缩放规律RAG证据质量风险评估放射学

Published 2026-05-06 01:57Recent activity 2026-05-06 10:34Estimated read 8 min

Safety and Accuracy of Clinical Large Language Models Follow Different Scaling Laws

Section 01

Safety and Accuracy of Clinical LLMs Follow Different Scaling Laws (Introduction)

This study proposes the SaFE-Scale framework and RadSaFE-200 benchmark, systematically evaluating the safety performance of 34 clinical LLMs under six deployment conditions. Core findings: Improved accuracy does not automatically lead to enhanced safety; high-quality evidence has the most significant impact on safety improvement, while standard RAG and agent-based RAG fail to replicate this safety feature. In the medical field, the core of AI safety lies in extreme risk control and avoiding confident errors, rather than average accuracy.

Section 02

Core Paradox of Clinical AI: Accuracy ≠ Safety

Default assumption in medical AI deployment: Increasing model size, context length, etc., will lead to improved accuracy and thus enhanced safety. However, this assumption is flawed: Medical safety focuses on extreme risks (e.g., errors in high-risk scenarios) rather than average performance; errors are asymmetric—confident errors are more dangerous than uncertain ones, as they may lead doctors to accept them without questioning, resulting in serious consequences.

Section 03

SaFE-Scale Framework and RadSaFE-200 Benchmark: New Tools for Safety Evaluation

SaFE-Scale Framework

Multidimensional scaling strategies: Examines 5 dimensions: model size (7B→70B+), context length (4K→128K+), retrieval complexity, evidence quality (clean/conflicting), and reasoning computation (standard/extended).
Safety metrics: Defines 4 specialized metrics—high-risk error rate, unsafe answer rate, evidence contradiction rate, and dangerous overconfidence rate—focusing on "degree of danger" and "certainty".

RadSaFE-200 Benchmark

Clinician-annotated: 200 questions designed and reviewed by practicing radiologists, based on real scenarios.
Multi-level evidence: Provides three types—clean evidence (consistently high-quality), conflicting evidence (contradictory and misleading), and no evidence (closed-book).
Fine-grained labels: Each option is annotated for correctness, whether it is a high-risk error, etc., supporting precise calculation of safety metrics.

Section 04

Key Experimental Findings: Clean Evidence is the Strongest Factor for Safety Improvement; RAG has Limited Effect

Key Finding 1: Clean evidence significantly improves safety

Accuracy from 73.5%→94.1% (+20.6%); high-risk error rate from12.0%→2.6% (-78%); evidence contradiction rate from12.7%→2.3% (-82%); dangerous overconfidence rate from8.0%→1.6% (-80%).

Key Finding 2: RAG does not automatically improve safety

Standard RAG improves accuracy but has limited safety improvement; agent-based RAG, though with higher accuracy and reduced evidence contradictions, still has high high-risk errors and overconfidence (due to "reasoning drift").

Key Finding 3: Limitations of maximum context and reasoning computation

Maximum context does not improve safety but increases latency; extended reasoning (chain-of-thought, etc.) brings only limited benefits.

Key Finding 4: Error concentration effect

A small number of "difficult" questions contribute to most safety risks, making average metrics misleading.

Section 05

In-depth Analysis: Three Mechanisms for the Decoupling of Safety and Accuracy

Mode 1: Confident errors

Arise from training data bias, side effects of prompt engineering (encouraging confidence), and confirmation bias (seeking evidence to support initial judgments).

Mode 2: Evidence misinterpretation

Ignoring subtle constraints in medical evidence (e.g., "Y drug is preferred under condition X" is generalized to all cases).

Mode 3: Traps of complex reasoning

Multi-step reasoning in agent-based RAG introduces more failure points (amplification of small errors, overinterpretation of evidence).

Section 06

Practical Implications: Four Recommendations for Building Safe Clinical AI

Prioritize investment in evidence quality: Establish high-quality clinical knowledge bases, conduct strict audits, update regularly, and annotate confidence levels and applicable scopes.
Design safety-aware prompts: Require models to express uncertainty, check consistency with evidence, and additional verification for high-risk decisions.
Implement layered safety mechanisms: Input layer identifies high-risk queries; processing layer uses conservative strategies; output layer performs safety checks; human-machine collaboration layer involves expert review.
Focus on worst-case scenarios: Analyze types of high-risk error problems and make targeted improvements.

Section 07

Conclusion and Outlook: Safety Requires Active Design; Future Directions are Clear

Core Conclusion

The safety of clinical LLMs is an actively designed deployment attribute, not a passive result of scaling. The traditional AI development paradigm (bigger model = safer) is dangerous in the medical field; safety needs to be clearly defined, measured, and optimized.

Limitations

Domain limitation: Focused on radiology; model limitation: Based on open-source models; time limitation: Needs regular verification.

Future Directions

Expand to specialized fields like emergency medicine and oncology; study dynamic safety (maintaining safety when models are updated); develop specialized safety training methods.

Key Message: Do not assume larger models are safer; instead, invest in evidence infrastructure, design safe workflows, and focus on worst-case scenarios.