Zing Forum

Reading

Road Accident Risk Prediction: A Comparative Study of Nine Machine Learning Models

A machine learning study on road accident risk prediction compared nine models using 112,000 synthetic data records, finding that standard linear regression achieves the best balance between interpretability and accuracy.

机器学习交通事故预测线性回归XGBoost可解释AI风险评估特征工程SHAP
Published 2026-06-11 02:15Recent activity 2026-06-11 02:23Estimated read 13 min
Road Accident Risk Prediction: A Comparative Study of Nine Machine Learning Models
1

Section 01

Introduction to Road Accident Risk Prediction Research

Road Accident Risk Prediction: A Comparative Study of Nine Machine Learning Models

Original Authors: Kaumindi Herath, Amasha Fernando, Saviru Mendis, Dilmith Yahathugoda Source: GitHub (Link) Publication Date: 2026-06-10 Course: DS-3003 Machine Learning | Group 11

Core Insight: This study compares the road accident risk prediction performance of nine machine learning models using 112,000 synthetic data records, finding that standard linear regression achieves the best balance between interpretability and accuracy.

2

Section 02

Research Background and Motivation

Research Background and Motivation

Road traffic accidents are one of the leading causes of casualties worldwide. According to the World Health Organization, approximately 1.3 million people die from road traffic accidents each year, and tens of millions are injured. Accurate prediction of road accident risk not only has academic research value but also provides practical guidance for public policy formulation, road design, and driver education.

This study was conducted by four data science students to identify key environmental and structural factors affecting road accident risk and evaluate the prediction performance of various machine learning models. The core question of the study is: Among numerous advanced machine learning algorithms, which model can achieve the best balance between accuracy and interpretability?

3

Section 03

Dataset and Feature Overview

Dataset and Feature Overview

Data Source and Scale

The study uses the Simulated Roads Accident Data dataset from Kaggle, which is under the CC0 public domain license. The dataset contains approximately 112,000 records, merged from three CSV files (2k, 10k, 100k).

Target Variable

The model's prediction target is accident_risk—a continuous risk score ranging from 0 (low risk) to 1 (high risk).

Feature List

Feature Type Description
road_type Categorical Road type: Highway, Urban, Rural
num_lanes Numerical Number of lanes
speed_limit Numerical Speed limit (mph)
curvature Numerical Degree of road curvature (0-1)
road_signs_present Binary Presence of traffic signs
weather Categorical Weather: Clear, Rainy, Foggy
lighting Categorical Lighting conditions: Daytime, Nighttime, Dim
time_of_day Categorical Time of day: Morning, Afternoon, Evening
holiday Binary Whether it is a holiday
school_season Binary Whether it is during the school term
public_road Binary Whether it is a public road
num_reported_accidents Numerical Number of historical accidents on the road segment
4

Section 04

Research Methods

Research Methods

Exploratory Data Analysis (EDA)

  • Visualization of feature distributions (histograms, box plots)
  • Correlation analysis between features
  • Scatter plots to explore relationships between features and the target variable

Feature Engineering

  • Binary Feature Construction: Create a high_speed flag to identify road segments with high speed limits
  • One-Hot Encoding: Perform one-hot encoding on categorical variables and remove reference categories to avoid multicollinearity
  • Clustering Analysis: Use K-Means for road segment clustering, but ultimately choose a global model instead of cluster-specific models

Model Comparison

The study compares nine machine learning models: Linear Regression, Ridge Regression, Lasso Regression, Elastic Net, Regression Tree, Random Forest, XGBoost, CatBoost, LightGBM.

Evaluation Metrics

  • MAE (Mean Absolute Error): Average absolute difference between predicted and actual values
  • RMSE (Root Mean Squared Error): Metric more sensitive to large errors
  • R² (Coefficient of Determination): Proportion of variance in the target variable explained by the model In addition, compare training and test set performance to detect overfitting.
5

Section 05

Research Results and Model Performance

Research Results and Model Performance

Key Risk Factors

Through feature importance analysis and SHAP value interpretation, the following key risk factors were identified:

  1. Road Curvature: The strongest predictor—higher curvature leads to higher risk
  2. Speed Limit: Strong positive correlation with risk
  3. Nighttime Lighting: Reduced visibility significantly increases risk
  4. Adverse Weather: Foggy and rainy conditions increase risk

Model Performance Comparison

Model MAE RMSE
Linear Regression ✅ 0.0502 0.0632 0.8740
Ridge Regression 0.0502 0.0632 0.8740
Lasso 0.0502 0.0632 0.8740
CatBoost 0.0503 0.0632 0.8739
Elastic Net 0.0503 0.0633 0.8737
XGBoost 0.0040 0.0633 0.8735
LightGBM 0.0509 0.0641 0.8704
Random Forest 0.0542 0.0681 0.8539

Core Findings

Standard linear regression emerged as the optimal model: highest R², lowest error, and no signs of overfitting. This challenges the bias that complex models are better. The advantages of linear regression include strong interpretability, fast training, good generalization ability, and high stability.

Overfitting Analysis

  • Linear models (Linear Regression, Ridge Regression, etc.) show no overfitting
  • Tree models (Random Forest, XGBoost, etc.) show slight signs of overfitting
  • Regression Tree performance lags behind ensemble methods
6

Section 06

Interpretability Analysis

Interpretability Analysis

Coefficient Magnitude Analysis

Linear regression coefficients directly reflect the marginal contribution of each feature to risk. The most influential features are identified through coefficient magnitude plots.

SHAP Value Analysis

SHAP values provide fine-grained interpretation:

  • Contribution degree of each feature in each prediction
  • Relationship between feature values and contribution direction (positive/negative)
  • Global feature importance ranking

Permutation Importance

By randomly shuffling feature values and observing performance degradation, it provides a model-agnostic measure of feature importance. The results are consistent with SHAP and coefficient analysis.

7

Section 07

Research Limitations and Future Directions

Research Limitations and Future Directions

Data Limitations

  1. Synthetic Data: Cannot fully reflect real-world complexity
  2. Geographic Limitation: No geographic location annotations, so regional differences cannot be analyzed
  3. Time Dimension: Lack of time series information, so trend analysis cannot be performed

Model Limitations

  1. Static Prediction: Does not consider dynamic factors such as real-time traffic flow
  2. Causal Relationship: Correlation does not equal causal inference
  3. Extreme Events: Samples of high-risk events may be insufficient

Future Improvement Directions

  1. Real Data Validation: Validate the model on real datasets
  2. Spatio-Temporal Modeling: Introduce time and spatial features
  3. Deep Learning: Try neural networks that capture feature interactions
  4. Real-Time Deployment: Build API services to support real-time risk scoring
  5. Intervention Strategies: Design safety intervention measures based on model insights
8

Section 08

Implications for Practitioners and Conclusion

Implications for Practitioners and Conclusion

Implications

  1. Simplicity First: Use linear regression to establish a baseline first; if it meets requirements, there is no need for complex models
  2. Value of Interpretability: In safety-critical fields, interpretability is more important than precision
  3. Comprehensive Evaluation: Use multiple metrics comprehensively to avoid choosing models with poor generalization ability
  4. Domain Knowledge: Model results need to be cross-validated with professional knowledge

Conclusion

This study demonstrates a complete data science workflow and emphasizes the value of simple tools. For beginners, it is an excellent learning example: clear documentation, complete code, honest analysis, and emphasis on interpretability.