Zing Forum

Reading

Clinical AI Safety Evaluation Framework: When Large Models 'Answer Correctly' But 'Act Wrongly'

A study by Wrexham Glyndwr University reveals a striking gap in medical LLMs: diagnostic accuracy reaches 93.3%, but clinical safety pass rate is only 6.7%, with a hallucination rate as high as 76.7%. An 11-indicator comprehensive evaluation framework is open-sourced.

医疗AI临床安全LLM评估幻觉检测急性胸痛NICE指南诊断准确率AI安全框架
Published 2026-04-13 22:38Recent activity 2026-04-13 22:49Estimated read 5 min
Clinical AI Safety Evaluation Framework: When Large Models 'Answer Correctly' But 'Act Wrongly'
1

Section 01

[Introduction] Medical LLM Diagnostic Accuracy ≠ Safety! Only 6.7% Safety Pass Rate Behind 93.3% Accuracy

A study by Wrexham Glyndwr University reveals a striking gap in medical LLMs: For acute chest pain cases, Gemini 3.1 Pro achieves a diagnostic accuracy of 93.3%, but its clinical safety pass rate is only 6.7%, with a hallucination rate of 76.7%. The research team open-sourced a comprehensive evaluation framework containing 11 indicators, emphasizing that clinical safety requires balancing results and reasoning processes.

2

Section 02

Background: Blind Spot in Traditional Evaluation—Diagnostic Accuracy ≠ Clinical Safety

Traditional medical AI evaluations often focus on a single indicator (e.g., diagnostic accuracy), but the study found that even if an LLM makes a correct diagnosis, its reasoning process may be full of errors and hallucinations (76.7% in this study), leading to extremely high clinical safety risks. For example, among 30 acute chest pain cases, only 2 passed clinical safety audits, revealing that traditional evaluations severely overestimate the actual safety of models.

3

Section 03

Methodology: 11-Indicator Comprehensive Evaluation Framework + Dual-Model Experimental Design

The study designed 11 indicators (divided into three categories: outcome, process, and comprehensive audit), covering diagnostic accuracy, under-triage rate, red flag recognition rate, response stability, hallucination rate, clinical audit gate, etc. The experiment used 30 synthetic acute chest pain cases (including trap cases) and adopted a dual-model design: Gemini 3.1 Pro as the tested model, and GPT-5.2 as the judging model (scoring in deterministic mode) to reduce evaluation bias.

4

Section 04

Evidence: Three Key Findings Expose Safety Risks of Medical LLMs

  1. Outcome-Process Gap: 93.3% diagnostic accuracy vs. 6.7% safety pass rate, an 86.6% gap; 2. Prevalence of Hallucinations: 76.7% of cases contain fictional clinical facts; 3. Dangerous Success (FLAG): A large number of cases have correct diagnoses but flawed reasoning, which is deceptive and easily misleads clinical decisions.
5

Section 05

Open-Source Resources: Complete Experimental Pipeline Open to Support Reproduction and Expansion

The research team open-sourced all experimental resources, including: 30 clinical case JSON files, scoring result data (CSV/Excel), original model responses, evaluation scripts (evaluate_vignettes.py/score_results.py), case generation tools, and pre-registration plans, to facilitate other researchers to reproduce results, test models, or expand to other clinical fields.

6

Section 06

Implications: Medical AI Development Requires Multi-Dimensional Evaluation and Strict Auditing

Implications for developers and regulators: 1. Need to introduce multi-dimensional indicators (process quality, consistency, hallucination detection, etc.); 2. Must undergo comprehensive auditing before deployment (e.g., M11 clinical audit gate); 3. LLMs are suitable as second-opinion tools and require human review; 4. Hallucination mitigation is one of the top R&D priorities.

7

Section 07

Limitations and Future Directions: Case Scale to Be Expanded, Framework to Be Applied to More Fields

Study Limitations: Number of cases reduced from 50 to 30 (time constraints), Gemini API limitations preventing logprob access, tested model changes (access restrictions). Future Directions: Expand the case library, test more models, apply to fields like dermatology/radiology, develop medical hallucination mitigation technologies, explore optimal human-AI collaboration models.