Reading

Machine Learning-Based Analysis of Behavioral Risk Factors for Tobacco Use: A Comparative Study of Multiple Algorithms

A comprehensive data science project that uses multiple machine learning algorithms to analyze behavioral risk factors related to tobacco use, and explores the optimal prediction scheme by comparing the performance of different models.

机器学习公共卫生烟草使用风险因素数据分析随机森林支持向量机数据科学健康监测

Published 2026-06-16 10:14Recent activity 2026-06-16 10:21Estimated read 5 min

Machine Learning-Based Analysis of Behavioral Risk Factors for Tobacco Use: A Comparative Study of Multiple Algorithms

Section 01

Introduction: Machine Learning-Based Comparative Study of Behavioral Risk Factors for Tobacco Use

This project is an end-to-end data science workflow that uses multiple machine learning algorithms to analyze behavioral risk factors for tobacco use and predict the "upper limit of high confidence" indicator. The data is sourced from the Behavioral Risk Factor Surveillance System (BRFSS) from 2011 to the present. After comparing the performance of various algorithms, it was found that Random Forest and Support Vector Machine (SVM) performed the best, providing support for public health decision-making, medical research, and education.

Section 02

Research Background: Intersection of Public Health and Data Science

Tobacco use is one of the leading causes of preventable diseases and premature deaths globally, with over 8 million people dying from related diseases each year. Traditional epidemiology relies on statistical methods, while machine learning can handle complex nonlinear relationships in high-dimensional data and provide robust evaluations through cross-validation, bringing new possibilities to this field.

Section 03

Technical Methods: Complete Machine Learning Workflow

Data Preprocessing

Handle missing values, outliers, duplicate records; encode categorical variables (label/one-hot); normalize data.

Exploratory Data Analysis (EDA)

Visualize using Matplotlib and Seaborn to understand distributions, correlations, patterns, and anomalies.

Dimensionality Reduction

Use PCA to reduce the number of features, eliminate multicollinearity, and improve efficiency.

Model Selection

Implement multiple types of algorithms including regression (Linear/Lasso/Ridge), classification (Logistic Regression/Naive Bayes/KNN/Decision Tree/Random Forest/SVM), neural networks (Perceptron/MLP), and clustering (K-Means/K-Medoids).

Evaluation

Use k-fold cross-validation, evaluate with multiple metrics (accuracy/precision/recall/F1), and visualize confusion matrices.

Section 04

Core Evidence: Model Performance Comparison and Findings

Random Forest and Support Vector Machine achieved the highest accuracy. Random Forest reduces overfitting through ensemble learning and captures feature interactions; SVM handles nonlinear relationships via kernel tricks. The performance differences among different algorithms reflect their characteristics: tree models excel at nonlinear interactions, linear models have strong interpretability, and neural networks require more data and parameter tuning.

Section 05

Project Value: Applications in Public Health and Education

Technical Highlights

Complete ML workflow, multiple algorithm comparisons, equal emphasis on code and documentation (Python scripts + Notebooks), reproducibility (requirements.txt + LICENSE).

Application Value

Public health decision-making: identify high-risk groups, predict trends, evaluate intervention effects; medical research: generate hypotheses, identify variables; education: real data processing workflow, algorithm comparison examples.

Section 06

Limitations and Future Directions

Current Limitations

Class imbalance affects performance; feature engineering can be optimized; insufficient hyperparameter tuning.

Improvement Directions

Integrate XGBoost/LightGBM; explore deep learning; time series analysis; causal inference to understand the mechanism of risk factors' impact.