Zing 论坛

正文

PRIME-CVD:面向医学信息学教育的隐私保护心血管风险模拟数据集

UNSW 健康大数据研究中心开源的教育数据集,通过因果 DAG 生成 5 万例模拟患者数据,提供干净队列和真实 EMR 风格的"脏数据"双版本,支持因果推断、生存分析、数据清洗等医学信息学教学

医学信息学心血管风险模拟数据隐私保护因果推断DAG电子病历EMR生存分析数据清洗
发布时间 2026/04/10 13:35最近活动 2026/04/10 13:47预计阅读 4 分钟
PRIME-CVD:面向医学信息学教育的隐私保护心血管风险模拟数据集
1

章节 01

PRIME-CVD: Open-Source Privacy-Protected Dataset for Medical Informatics Education

PRIME-CVD is an open-source educational dataset developed by UNSW Health Big Data Research Center (CBDRH). It generates 50,000 simulated patient records via causal Directed Acyclic Graph (DAG), offering two versions: clean analysis-ready queue and real EMR-style "dirty" data. It supports teaching of causal inference, survival analysis, data cleaning, etc., while ensuring full privacy protection.

2

章节 02

Background: Data Access vs Privacy Dilemma in Medical Informatics

Medical informatics education faces a long-standing conflict between data access and privacy. Real EMR data is sensitive and regulated, hard to share; fully synthetic data lacks real-world complexity. PRIME-CVD is designed to solve this by providing privacy-safe, realistic simulated data.

3

章节 03

Dataset Composition: Dual Versions for Diverse Scenarios

  • Clean Queue: 50k longitudinal records with variables like demographics (age, IRSD), lifestyle (smoking, BMI), clinical indicators (diabetes, HbA1c), cardiovascular status. Suitable for survival analysis, causal estimation.
  • EMR-style Data: Relational tables with heterogeneities, missing values, unit inconsistencies (e.g., blood pressure in mmHg/kPa). Used for data cleaning, record linkage training.
4

章节 04

Technical Method: Causal DAG-Driven Generation

  • Causal DAG: Hand-built to model cardiovascular risk factor relationships (e.g., smoking→CVD, age→diabetes→CVD).
  • Parameters: From authoritative sources (ABS, AIHW, published studies).
  • Reproducibility: Deterministic generation—same seed produces identical data, enabling standard answers and fair comparisons.
5

章节 05

Educational Resources & Application Scenarios

  • Resources: Series blogs/notebooks (dataset intro, assessment design, core concepts like discrimination/calibration) and Python/R quickstart notebooks.
  • Use Cases: Educators (design assignments/exams), students (practice real data skills), researchers (test algorithms, validate methods).
6

章节 06

Core Advantages of PRIME-CVD

Feature Description
Privacy Safety Fully synthetic data, no privacy leakage risk
Education-Oriented Clear DAG and EMR artifact design
Reproducible Deterministic process for dataset reconstruction
Dual Assets Clean queue + dirty EMR covering full analysis flow
Open Access Code, data, tutorials all open-source