Zing Forum

Reading

Cardiovascular Disease Prediction: Multi-Model Comparison and Ensemble Optimization Based on the Cleveland Dataset

A complete machine learning workflow was built using the Cleveland Heart Disease Dataset, comparing logistic regression, neural networks, and ensemble learning models, achieving an accuracy of 91.67% through Optuna hyperparameter optimization.

心血管疾病预测机器学习逻辑回归神经网络集成学习Optuna优化克利夫兰数据集医疗AI
Published 2026-05-12 11:25Recent activity 2026-05-12 11:30Estimated read 6 min
Cardiovascular Disease Prediction: Multi-Model Comparison and Ensemble Optimization Based on the Cleveland Dataset
1

Section 01

Introduction to Cardiovascular Disease Prediction Research

This study builds a complete machine learning prediction workflow based on the Cleveland Heart Disease Dataset, comparing logistic regression, neural networks, and ensemble learning models. Through techniques like Optuna hyperparameter optimization, it finally achieves an accuracy of 91.67% and an ROC-AUC of 0.9632. The research explores key technologies such as data preprocessing optimization and model tuning, providing a reference solution for early risk identification of cardiovascular diseases.

2

Section 02

Research Background and Dataset Introduction

Cardiovascular disease is a major global health threat. Traditional risk assessment relies on experience and simple indicators, making it difficult to fully utilize multi-dimensional data. Machine learning technology provides new possibilities for early prediction.

The project uses the Cleveland Heart Disease Dataset (303 records, 14 clinical features) from UCI and Kaggle. Features include age, gender, chest pain type, etc., with a binary classification label as the target. The dataset has undergone preprocessing such as missing value handling, outlier detection, and standardization.

3

Section 03

Model Design and Methodology

The project compares multiple models:

  1. Logistic Regression: As the baseline, after Z-Score standardization and SMOTE enhancement, the test set accuracy is 91.67% and ROC-AUC is 0.9520;
  2. Neural Network: Built with Keras, including Dropout, batch normalization, and early stopping mechanisms, with an accuracy of 88.33% and ROC-AUC of 0.9632;
  3. Ensemble Learning: Soft voting strategy to fuse base learners, balancing accuracy and ROC-AUC.
4

Section 04

Key Technical Optimization Points

  1. Data Preprocessing: Z-Score eliminates dimensional differences, SMOTE solves class imbalance;
  2. Hyperparameter Optimization: Optuna Bayesian optimization (100 rounds) improves parameter tuning efficiency;
  3. Threshold Adjustment: Optimize F1 score to determine the optimal threshold, increasing ROC-AUC to 0.9632;
  4. Cross-Validation: 10-fold stratified cross-validation ensures stable evaluation.
5

Section 05

Comparative Analysis of Experimental Results

Performance of each model: Logistic regression leads in accuracy (91.67%), neural network has the best ROC-AUC (0.9632), and the ensemble model balances both. After tuning, the final model achieves both 91.67% accuracy and 0.9632 ROC-AUC.

The results show that for small to medium-sized tabular data, traditional models (such as logistic regression) can achieve high prediction levels when combined with feature engineering and optimization.

6

Section 06

Visualization and Project Usage Guide

The project generates visualization charts such as ROC curve comparison, confusion matrix, and feature importance ranking to help understand model decisions.

The project structure is clear: main.py implements the end-to-end pipeline, the dataset is heart_cleveland_upload.csv, and models are saved as pickle files. Users can reproduce the experiment by running main.py after installing dependencies, and the README document provides detailed instructions.

7

Section 07

Limitations and Future Improvement Directions

Limitations: Small dataset size, geographical restrictions, insufficient discussion on interpretability and fairness.

Future directions: Introduce diverse large-scale datasets, explore advanced deep learning architectures, develop model interpretation tools, and deploy as clinical auxiliary tools.