Zing Forum

Reading

Machine Learning Project for Diabetes Prediction Based on the CRISP-DM Framework

A data science course project from Goethe University Frankfurt that uses the CDC BRFSS 2015 dataset and follows the CRISP-DM methodology to build a diabetes prediction model, covering the complete machine learning process from business understanding to model deployment.

CRISP-DM机器学习糖尿病预测医疗AICDC BRFSS数据挖掘监督学习PythonScikit-Learn歌德大学
Published 2026-05-11 17:56Recent activity 2026-05-11 17:59Estimated read 6 min
Machine Learning Project for Diabetes Prediction Based on the CRISP-DM Framework
1

Section 01

[Introduction] Core Overview of the Diabetes Prediction Machine Learning Project Based on the CRISP-DM Framework

An open-source diabetes prediction project by a student team from Goethe University Frankfurt, strictly following the CRISP-DM methodology. It uses the U.S. CDC BRFSS 2015 dataset (approximately 253,000 samples) to build a multi-class prediction model (no diabetes/pre-diabetes/diabetes), covering the complete machine learning process from business understanding to deployment, and serving as an excellent teaching example for medical AI applications.

2

Section 02

Project Background and Data Source

Diabetes is a global chronic disease, and early identification of high-risk groups is crucial. The project data comes from the CDC BRFSS 2015 survey data, containing 253,000 samples and 21 health indicators; the target variable is three categories (0: no diabetes/1: pre-diabetes/2: diabetes), and the multi-class setting is more in line with clinical practice. The project is guided by Professor Kevin Bauer to ensure academic rigor.

3

Section 03

Systematic Practice of the CRISP-DM Methodology

CRISP-DM has six phases, and each phase of the project has a Jupyter Notebook document:

  1. Business Understanding: Clarify the goal (supervised learning to predict diabetes risk) and success criteria (accuracy, recall, etc.);
  2. Data Understanding: EDA to handle issues like class imbalance and missing values;
  3. Data Preparation: Cleaning, encoding, standardization, dimensionality reduction;
  4. Modeling: Try logistic regression, random forest, gradient boosting trees, neural networks, etc., with hyperparameter tuning + cross-validation, emphasizing interpretability;
  5. Evaluation: Evaluate using confusion matrix, ROC curve, etc., and consider deployment feasibility;
  6. Deployment: Provide complete documentation and reproducible code to lay the foundation for practical applications.
4

Section 04

Technical Architecture and Implementation Details

Code organization standards: notebooks/ stores analysis documents, src/ contains Python modules, data/ stores data, results/ stores models and outputs; Technology selection: Python ecosystem (Scikit-Learn as core, possibly using Keras/TensorFlow for deep learning); Version control uses Git with branch management + PR review; Dependencies are managed via requirements.txt for easy environment reproduction.

5

Section 05

Special Considerations for Healthcare AI

Healthcare AI needs to meet high requirements:

  1. Privacy Protection: BRFSS data is desensitized, and raw data is not included in version control;
  2. Model Fairness: Ensure accurate predictions for different age/gender/race groups and avoid bias;
  3. Interpretability: Use techniques like SHAP values and feature importance to help doctors understand the basis of predictions;
  4. Clinical Practicality: All features are routine physical examination indicators to enhance application value.
6

Section 06

Educational Value and Learning Resources

As a course project, it provides a structured learning path:

  • Recommended classic textbook: Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow;
  • Links to CRISP-DM guidelines, emphasizing the importance of standard processes;
  • Demonstrates industry practices such as Git collaboration, branch development, and code review to help students adapt to the work environment.
7

Section 07

Extended Applications and Summary

Extension directions: Can be migrated to predict other chronic diseases such as cardiovascular diseases; technically, can explore deep learning, AutoML, federated learning, etc.; application-wise, can be integrated into EHR systems or combined with wearable devices to achieve real-time early warning. Summary: The project demonstrates the application of CRISP-DM in medical scenarios, is a complete data science project example, has reference value for learners and practitioners, and has significant social significance.