Section 01
Safety and Accuracy of Clinical LLMs Follow Different Scaling Laws (Introduction)
This study proposes the SaFE-Scale framework and RadSaFE-200 benchmark, systematically evaluating the safety performance of 34 clinical LLMs under six deployment conditions. Core findings: Improved accuracy does not automatically lead to enhanced safety; high-quality evidence has the most significant impact on safety improvement, while standard RAG and agent-based RAG fail to replicate this safety feature. In the medical field, the core of AI safety lies in extreme risk control and avoiding confident errors, rather than average accuracy.