Zing Forum

Reading

Racial Bias in Medical AI: When Large Language Models Meet Clinical Diagnosis, How Do We Practice 'Do No Harm'?

A recent study uses the EU AI Act as a governance framework to evaluate racial bias in five major mainstream LLMs in clinical scenarios. The study found that all models exhibit deviations from real racial distributions in synthetic case generation tasks, while DeepSeek V3 shows significant bias mitigation effects when enhanced with an agent workflow.

医疗AI大语言模型种族偏见临床诊断智能体工作流欧盟AI法案公平性评估DeepSeekGPT-4
Published 2026-04-20 18:02Recent activity 2026-04-21 10:47Estimated read 5 min
Racial Bias in Medical AI: When Large Language Models Meet Clinical Diagnosis, How Do We Practice 'Do No Harm'?
1

Section 01

Introduction to Racial Bias Research in Medical AI: Fairness Challenges of LLMs and Mitigation Potential of Agents

This study uses the EU AI Act as a governance framework to evaluate racial bias in five major mainstream LLMs in clinical scenarios. Key findings include: all models have racial distribution deviations in synthetic case generation; DeepSeek V3 performs outstandingly in differential diagnosis tasks; after embedding it into a retrieval-augmented agent workflow, bias indicators improved significantly. The study aims to explore how to make medical AI adhere to the ethical principle of 'do no harm' and avoid exacerbating health inequalities.

2

Section 02

Research Background: Sources of Bias in Medical LLMs and Limitations of Existing Studies

Bias in large language models stems from structural inequalities and stereotypes in training data, which may manifest as deviations in disease risk assessment in the medical field. Previous studies have limitations: few comparisons of multiple models, focus on identifying problems rather than solving them, and lack guidance from a systematic governance framework. This study uses the EU AI Act (fairness requirements for high-risk AI systems) as an evaluation benchmark to fill these gaps.

3

Section 03

Research Methods: Design of a Dual-Task Evaluation System

The study uses dual tasks to evaluate implicit and explicit biases of models: 1. Synthetic case generation task: compare deviations between model-generated cases and real epidemiological racial distributions in the U.S.; 2. Differential diagnosis ranking task: test whether the diagnosis ranking for patients of different races is consistent with expert gold standards and whether there are systematic deviations.

4

Section 04

Key Findings: Prevalent Model Bias, Significant Effects of Agent Workflow

  1. All tested models deviate from real racial distributions in synthetic case generation; GPT-4.1 has the smallest deviation but still has bias; 2. DeepSeek V3 has the best overall performance in differential diagnosis tasks; 3. After embedding DeepSeek V3 into an agent workflow, bias indicators improved significantly: average p-value increased by 0.0348, median p-value increased by 0.1166, and average difference decreased by 0.0949.
5

Section 05

Mechanisms of Bias Mitigation by Agent Workflow

Compared to traditional single-turn reasoning, the agent workflow has three major improvements: 1. External knowledge retrieval: query authoritative medical databases/guidelines to reduce internal memory bias; 2. Structured reasoning chain: decompose diagnosis into subtasks to easily identify and correct biases; 3. Verifiable intermediate steps: facilitate auditing and provide a basis for bias detection.

6

Section 06

Practical Implications: Key Strategies for Building Fair Medical AI

  1. Multi-dimensional evaluation: use multiple indicators such as p-value and average difference to comprehensively capture bias; 2. Architecture design: embedding LLMs into agent workflows is key to improving fairness; 3. Regulatory-driven: evaluate based on the EU AI Act to clarify compliance goals and important dimensions.
7

Section 07

Research Limitations and Future Directions

Limitations: based on U.S. epidemiological data, the applicability of results needs verification; the improvement range of agents is uneven. Future directions: explore the effects of different agent architectures, bias issues in multi-modal medical AI, and dynamic changes of bias in long-term clinical deployment.