# Clinical AI Safety Evaluation Framework: When Large Models 'Answer Correctly' But 'Act Wrongly'

> A study by Wrexham Glyndwr University reveals a striking gap in medical LLMs: diagnostic accuracy reaches 93.3%, but clinical safety pass rate is only 6.7%, with a hallucination rate as high as 76.7%. An 11-indicator comprehensive evaluation framework is open-sourced.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-13T14:38:39.000Z
- 最近活动: 2026-04-13T14:49:38.900Z
- 热度: 150.8
- 关键词: 医疗AI, 临床安全, LLM评估, 幻觉检测, 急性胸痛, NICE指南, 诊断准确率, AI安全框架
- 页面链接: https://www.zingnex.cn/en/forum/thread/ai-f28e001b
- Canonical: https://www.zingnex.cn/forum/thread/ai-f28e001b
- Markdown 来源: floors_fallback

---

## [Introduction] Medical LLM Diagnostic Accuracy ≠ Safety! Only 6.7% Safety Pass Rate Behind 93.3% Accuracy

A study by Wrexham Glyndwr University reveals a striking gap in medical LLMs: For acute chest pain cases, Gemini 3.1 Pro achieves a diagnostic accuracy of 93.3%, but its clinical safety pass rate is only 6.7%, with a hallucination rate of 76.7%. The research team open-sourced a comprehensive evaluation framework containing 11 indicators, emphasizing that clinical safety requires balancing results and reasoning processes.

## Background: Blind Spot in Traditional Evaluation—Diagnostic Accuracy ≠ Clinical Safety

Traditional medical AI evaluations often focus on a single indicator (e.g., diagnostic accuracy), but the study found that even if an LLM makes a correct diagnosis, its reasoning process may be full of errors and hallucinations (76.7% in this study), leading to extremely high clinical safety risks. For example, among 30 acute chest pain cases, only 2 passed clinical safety audits, revealing that traditional evaluations severely overestimate the actual safety of models.

## Methodology: 11-Indicator Comprehensive Evaluation Framework + Dual-Model Experimental Design

The study designed 11 indicators (divided into three categories: outcome, process, and comprehensive audit), covering diagnostic accuracy, under-triage rate, red flag recognition rate, response stability, hallucination rate, clinical audit gate, etc. The experiment used 30 synthetic acute chest pain cases (including trap cases) and adopted a dual-model design: Gemini 3.1 Pro as the tested model, and GPT-5.2 as the judging model (scoring in deterministic mode) to reduce evaluation bias.

## Evidence: Three Key Findings Expose Safety Risks of Medical LLMs

1. **Outcome-Process Gap**: 93.3% diagnostic accuracy vs. 6.7% safety pass rate, an 86.6% gap; 2. **Prevalence of Hallucinations**: 76.7% of cases contain fictional clinical facts; 3. **Dangerous Success (FLAG)**: A large number of cases have correct diagnoses but flawed reasoning, which is deceptive and easily misleads clinical decisions.

## Open-Source Resources: Complete Experimental Pipeline Open to Support Reproduction and Expansion

The research team open-sourced all experimental resources, including: 30 clinical case JSON files, scoring result data (CSV/Excel), original model responses, evaluation scripts (evaluate_vignettes.py/score_results.py), case generation tools, and pre-registration plans, to facilitate other researchers to reproduce results, test models, or expand to other clinical fields.

## Implications: Medical AI Development Requires Multi-Dimensional Evaluation and Strict Auditing

Implications for developers and regulators: 1. Need to introduce multi-dimensional indicators (process quality, consistency, hallucination detection, etc.); 2. Must undergo comprehensive auditing before deployment (e.g., M11 clinical audit gate); 3. LLMs are suitable as second-opinion tools and require human review; 4. Hallucination mitigation is one of the top R&D priorities.

## Limitations and Future Directions: Case Scale to Be Expanded, Framework to Be Applied to More Fields

Study Limitations: Number of cases reduced from 50 to 30 (time constraints), Gemini API limitations preventing logprob access, tested model changes (access restrictions). Future Directions: Expand the case library, test more models, apply to fields like dermatology/radiology, develop medical hallucination mitigation technologies, explore optimal human-AI collaboration models.