Zing Forum

Reading

Diabetes Prediction Based on Lifestyle Indicators: A Complete End-to-End Machine Learning Practice

A complete data science project using BRFSS 2015 health survey data, combining unsupervised clustering and gradient boosting models to achieve high recall diabetes risk prediction under class imbalance conditions.

机器学习糖尿病预测无监督学习K-Means高斯混合模型XGBoostLightGBMSMOTE类别不平衡医疗AI
Published 2026-06-08 14:15Recent activity 2026-06-08 14:19Estimated read 4 min
Diabetes Prediction Based on Lifestyle Indicators: A Complete End-to-End Machine Learning Practice
1

Section 01

[Introduction] End-to-End Machine Learning Practice for Diabetes Prediction Based on Lifestyle Indicators

This project is a complete data science practice using the U.S. CDC's BRFSS 2015 health survey data. It combines unsupervised clustering (K-Means, Gaussian Mixture Model) and gradient boosting models (XGBoost, LightGBM), and uses SMOTE technology to address class imbalance issues, achieving high recall diabetes risk prediction. It covers the entire workflow of data cleaning, feature engineering, model training, and evaluation.

2

Section 02

Project Background and Motivation

Diabetes is a major global public health challenge. Traditional screening relying on clinical judgment and single blood glucose indicators easily misses early cases. This project uses a data-driven approach, leveraging the BRFSS 2015 dataset (over 250,000 records, 21 features including lifestyle and health status) to build an end-to-end prediction pipeline, supporting early identification of high-risk groups.

3

Section 03

Data Processing and Model Construction Methods

  1. Data preprocessing: Remove 24206 duplicate records, distinguish continuous variables (BMI, health days) and categorical variables (hypertension, smoking status, etc.);
  2. Unsupervised learning: Use K-Means (k=4) and GMM (k=3) clustering to discover population health patterns, visualized with radar charts;
  3. Feature engineering: SelectKBest (ANOVA F-value) for feature selection, ColumnTransformer to handle numerical and categorical features;
  4. Class imbalance: SMOTE to synthesize minority class samples, cross-validation without data leakage.
4

Section 04

Model Comparison and Performance Results

Compare models such as logistic regression, random forest, KNN, XGBoost, and LightGBM, with priority given to recall evaluation. Set thresholds: accuracy ≥70% and recall ≥70%. XGBoost (recall 78.0%, accuracy 70.6%) and LightGBM (recall 77.8%, accuracy 71.0%) performed best.

5

Section 05

Practical Significance and Summary Insights

Application prospects: Can serve as an initial screening tool to identify high-risk groups and reduce screening costs; Summary: Combine classic and modern technologies to solve medical problems, emphasizing the importance of evaluation metric selection under class imbalance; Insights: Provide a full-process reference for medical AI learners, and the rigorous experimental design is worth learning from.

6

Section 06

Technical Highlights and Reproducibility

The project has excellent engineering practices: complete dependency management, clear code structure, interactive visualization (ipywidgets), detailed documentation; Jupyter Notebook makes the process transparent and traceable; provides a dynamic threshold adjustment interface, enhancing educational and practical value.