# Diabetes Prediction Based on Lifestyle Indicators: A Complete End-to-End Machine Learning Practice

> A complete data science project using BRFSS 2015 health survey data, combining unsupervised clustering and gradient boosting models to achieve high recall diabetes risk prediction under class imbalance conditions.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-08T06:15:15.000Z
- 最近活动: 2026-06-08T06:19:12.706Z
- 热度: 147.9
- 关键词: 机器学习, 糖尿病预测, 无监督学习, K-Means, 高斯混合模型, XGBoost, LightGBM, SMOTE, 类别不平衡, 医疗AI, 健康数据分析
- 页面链接: https://www.zingnex.cn/en/forum/thread/geo-github-kittycatkim-lifestyle-diabetes-prediction-using-unsupervised-machine-learning-mo
- Canonical: https://www.zingnex.cn/forum/thread/geo-github-kittycatkim-lifestyle-diabetes-prediction-using-unsupervised-machine-learning-mo
- Markdown 来源: floors_fallback

---

## [Introduction] End-to-End Machine Learning Practice for Diabetes Prediction Based on Lifestyle Indicators

This project is a complete data science practice using the U.S. CDC's BRFSS 2015 health survey data. It combines unsupervised clustering (K-Means, Gaussian Mixture Model) and gradient boosting models (XGBoost, LightGBM), and uses SMOTE technology to address class imbalance issues, achieving high recall diabetes risk prediction. It covers the entire workflow of data cleaning, feature engineering, model training, and evaluation.

## Project Background and Motivation

Diabetes is a major global public health challenge. Traditional screening relying on clinical judgment and single blood glucose indicators easily misses early cases. This project uses a data-driven approach, leveraging the BRFSS 2015 dataset (over 250,000 records, 21 features including lifestyle and health status) to build an end-to-end prediction pipeline, supporting early identification of high-risk groups.

## Data Processing and Model Construction Methods

1. Data preprocessing: Remove 24206 duplicate records, distinguish continuous variables (BMI, health days) and categorical variables (hypertension, smoking status, etc.);
2. Unsupervised learning: Use K-Means (k=4) and GMM (k=3) clustering to discover population health patterns, visualized with radar charts;
3. Feature engineering: SelectKBest (ANOVA F-value) for feature selection, ColumnTransformer to handle numerical and categorical features;
4. Class imbalance: SMOTE to synthesize minority class samples, cross-validation without data leakage.

## Model Comparison and Performance Results

Compare models such as logistic regression, random forest, KNN, XGBoost, and LightGBM, with priority given to recall evaluation. Set thresholds: accuracy ≥70% and recall ≥70%. XGBoost (recall 78.0%, accuracy 70.6%) and LightGBM (recall 77.8%, accuracy 71.0%) performed best.

## Practical Significance and Summary Insights

Application prospects: Can serve as an initial screening tool to identify high-risk groups and reduce screening costs;
Summary: Combine classic and modern technologies to solve medical problems, emphasizing the importance of evaluation metric selection under class imbalance;
Insights: Provide a full-process reference for medical AI learners, and the rigorous experimental design is worth learning from.

## Technical Highlights and Reproducibility

The project has excellent engineering practices: complete dependency management, clear code structure, interactive visualization (ipywidgets), detailed documentation; Jupyter Notebook makes the process transparent and traceable; provides a dynamic threshold adjustment interface, enhancing educational and practical value.
