Zing Forum

Reading

PRIME-CVD: A Privacy-Protected Cardiovascular Risk Simulation Dataset for Medical Informatics Education

An open-source educational dataset from UNSW Health Big Data Research Center that generates 50,000 simulated patient records using causal Directed Acyclic Graphs (DAGs). It offers two versions: a clean cohort and real EMR-style "dirty data", supporting medical informatics teaching in areas like causal inference, survival analysis, and data cleaning.

医学信息学心血管风险模拟数据隐私保护因果推断DAG电子病历EMR生存分析数据清洗
Published 2026-04-10 13:35Recent activity 2026-04-10 13:47Estimated read 4 min
PRIME-CVD: A Privacy-Protected Cardiovascular Risk Simulation Dataset for Medical Informatics Education
1

Section 01

PRIME-CVD: Open-Source Privacy-Protected Dataset for Medical Informatics Education

PRIME-CVD is an open-source educational dataset developed by UNSW Health Big Data Research Center (CBDRH). It generates 50,000 simulated patient records via causal Directed Acyclic Graph (DAG), offering two versions: clean analysis-ready queue and real EMR-style "dirty" data. It supports teaching of causal inference, survival analysis, data cleaning, etc., while ensuring full privacy protection.

2

Section 02

Background: Data Access vs Privacy Dilemma in Medical Informatics

Medical informatics education faces a long-standing conflict between data access and privacy. Real EMR data is sensitive and regulated, hard to share; fully synthetic data lacks real-world complexity. PRIME-CVD is designed to solve this by providing privacy-safe, realistic simulated data.

3

Section 03

Dataset Composition: Dual Versions for Diverse Scenarios

  • Clean Queue: 50k longitudinal records with variables like demographics (age, IRSD), lifestyle (smoking, BMI), clinical indicators (diabetes, HbA1c), cardiovascular status. Suitable for survival analysis, causal estimation.
  • EMR-style Data: Relational tables with heterogeneities, missing values, unit inconsistencies (e.g., blood pressure in mmHg/kPa). Used for data cleaning, record linkage training.
4

Section 04

Technical Method: Causal DAG-Driven Generation

  • Causal DAG: Hand-built to model cardiovascular risk factor relationships (e.g., smoking→CVD, age→diabetes→CVD).
  • Parameters: From authoritative sources (ABS, AIHW, published studies).
  • Reproducibility: Deterministic generation—same seed produces identical data, enabling standard answers and fair comparisons.
5

Section 05

Educational Resources & Application Scenarios

  • Resources: Series blogs/notebooks (dataset intro, assessment design, core concepts like discrimination/calibration) and Python/R quickstart notebooks.
  • Use Cases: Educators (design assignments/exams), students (practice real data skills), researchers (test algorithms, validate methods).
6

Section 06

Core Advantages of PRIME-CVD

Feature Description
Privacy Safety Fully synthetic data, no privacy leakage risk
Education-Oriented Clear DAG and EMR artifact design
Reproducible Deterministic process for dataset reconstruction
Dual Assets Clean queue + dirty EMR covering full analysis flow
Open Access Code, data, tutorials all open-source