# Data Leakage Pitfalls in Medical AI: A Comparative Study of Two Breast Cancer Recurrence Prediction Models

> This article deeply analyzes an open-source breast cancer recurrence prediction project, which reveals data leakage issues in medical machine learning by comparing two neural network models and demonstrates how to build a clinically practical prediction system.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-05T21:15:35.000Z
- 最近活动: 2026-06-05T21:17:53.848Z
- 热度: 153.0
- 关键词: 医疗AI, 数据泄露, 乳腺癌, 神经网络, 机器学习, 临床预测, SHAP解释性, TensorFlow, Keras
- 页面链接: https://www.zingnex.cn/en/forum/thread/ai-52e01f5f
- Canonical: https://www.zingnex.cn/forum/thread/ai-52e01f5f
- Markdown 来源: floors_fallback

---

## Introduction: Data Leakage Pitfalls in Medical AI — A Comparative Study of Two Breast Cancer Recurrence Prediction Models

This article focuses on the GitHub open-source project `breast-cancer-recurrence-ann`, which reveals data leakage issues in medical AI by comparing two neural network models and demonstrates how to build a clinically practical prediction system. The core value of the project is to remind the medical AI community: seemingly excellent model metrics may hide fatal flaws, and systems need to be designed in combination with clinical reality.

## Project Background and Dataset Introduction

This project uses the German Breast Cancer Study Group (GBSG) dataset, which includes 11 clinical features (age, menopausal status, tumor size, etc.) of 686 patients, with the prediction target being whether the patient will relapse or die. The `rfstime` field (recurrence-free survival time) in the dataset contains future information and is a potential source of data leakage.

## Dual Model Design: Key Comparison to Reveal Data Leakage

### Model A (Including Leaked Information)
Uses all features (including `rfstime`), built with TensorFlow/Keras as a deep neural network (3 hidden layers + regularization/dropout), pursuing high prediction performance but suffering from target leakage.

### Model B (Clinical Reality Simulation)
Excludes the `rfstime` field to simulate real clinical scenarios; optimizes the classification threshold to achieve Recall ≥0.75, prioritizing the reduction of missed diagnoses (False Negatives), which is more in line with clinical decision-making needs.

## Technical Implementation and Model Evaluation Details

- Evaluation metrics: Accuracy, Precision, Recall, F1-Score, ROC-AUC, with comparison to a logistic regression baseline model.
- Interpretability: Uses SHAP analysis to identify key influencing features, enhancing model transparency.
- Code structure: Includes modules for data preprocessing, model training and testing, performance visualization, SHAP analysis, etc.

## Core Findings: How Data Leakage Distorts Model Performance

Model A achieves artificially high accuracy due to using `rfstime` (future information) but is unusable in clinical scenarios; Model B, through reasonable feature selection and threshold optimization, is more suitable for practical applications. Data leakage is common in medical data (such as future information like survival time and follow-up results), which easily leads to model 'cheating'.

## Practical Significance and Insights for Medical AI Developers

1. Data leakage checks should become a standard process, and the temporal attributes of features need to be reviewed.
2. Evaluation metrics need to align with clinical goals (e.g., prioritizing Recall to reduce missed diagnoses).
3. Model interpretability is a necessity (e.g., SHAP tools) to build trust with doctors.
4. The gap between academic research and clinical deployment needs to be bridged, considering real-world constraints.

## Conclusion: Towards Responsible Medical AI

Although this project is small in scale, it touches on core issues in medical AI: technical performance needs to be combined with medical ethics and clinical reality. Data leakage is not just a technical bug, but also a reflection of insufficient understanding of application scenarios. The project provides learning resources for medical AI developers and emphasizes the need to uphold a responsible attitude in the health field.