Reading

Cardiovascular Disease Prediction: Multi-Model Comparison and Ensemble Optimization Based on the Cleveland Dataset

A complete machine learning workflow was built using the Cleveland Heart Disease Dataset, comparing logistic regression, neural networks, and ensemble learning models, achieving an accuracy of 91.67% through Optuna hyperparameter optimization.

心血管疾病预测机器学习逻辑回归神经网络集成学习Optuna优化克利夫兰数据集医疗AI

Published 2026-05-12 11:25Recent activity 2026-05-12 11:30Estimated read 6 min

Cardiovascular Disease Prediction: Multi-Model Comparison and Ensemble Optimization Based on the Cleveland Dataset

Section 01

Introduction to Cardiovascular Disease Prediction Research

This study builds a complete machine learning prediction workflow based on the Cleveland Heart Disease Dataset, comparing logistic regression, neural networks, and ensemble learning models. Through techniques like Optuna hyperparameter optimization, it finally achieves an accuracy of 91.67% and an ROC-AUC of 0.9632. The research explores key technologies such as data preprocessing optimization and model tuning, providing a reference solution for early risk identification of cardiovascular diseases.

Section 02

Research Background and Dataset Introduction

Cardiovascular disease is a major global health threat. Traditional risk assessment relies on experience and simple indicators, making it difficult to fully utilize multi-dimensional data. Machine learning technology provides new possibilities for early prediction.

The project uses the Cleveland Heart Disease Dataset (303 records, 14 clinical features) from UCI and Kaggle. Features include age, gender, chest pain type, etc., with a binary classification label as the target. The dataset has undergone preprocessing such as missing value handling, outlier detection, and standardization.

Section 03

Model Design and Methodology

The project compares multiple models:

Logistic Regression: As the baseline, after Z-Score standardization and SMOTE enhancement, the test set accuracy is 91.67% and ROC-AUC is 0.9520;
Neural Network: Built with Keras, including Dropout, batch normalization, and early stopping mechanisms, with an accuracy of 88.33% and ROC-AUC of 0.9632;
Ensemble Learning: Soft voting strategy to fuse base learners, balancing accuracy and ROC-AUC.

Section 04

Key Technical Optimization Points

Data Preprocessing: Z-Score eliminates dimensional differences, SMOTE solves class imbalance;
Hyperparameter Optimization: Optuna Bayesian optimization (100 rounds) improves parameter tuning efficiency;
Threshold Adjustment: Optimize F1 score to determine the optimal threshold, increasing ROC-AUC to 0.9632;
Cross-Validation: 10-fold stratified cross-validation ensures stable evaluation.

Section 05

Comparative Analysis of Experimental Results

Performance of each model: Logistic regression leads in accuracy (91.67%), neural network has the best ROC-AUC (0.9632), and the ensemble model balances both. After tuning, the final model achieves both 91.67% accuracy and 0.9632 ROC-AUC.

The results show that for small to medium-sized tabular data, traditional models (such as logistic regression) can achieve high prediction levels when combined with feature engineering and optimization.

Section 06

Visualization and Project Usage Guide

The project generates visualization charts such as ROC curve comparison, confusion matrix, and feature importance ranking to help understand model decisions.

The project structure is clear: main.py implements the end-to-end pipeline, the dataset is heart_cleveland_upload.csv, and models are saved as pickle files. Users can reproduce the experiment by running main.py after installing dependencies, and the README document provides detailed instructions.

Section 07

Limitations and Future Improvement Directions

Limitations: Small dataset size, geographical restrictions, insufficient discussion on interpretability and fairness.

Future directions: Introduce diverse large-scale datasets, explore advanced deep learning architectures, develop model interpretation tools, and deploy as clinical auxiliary tools.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54