Reading

Road Accident Risk Prediction: A Comparative Study of Nine Machine Learning Models

A machine learning study on road accident risk prediction compared nine models using 112,000 synthetic data records, finding that standard linear regression achieves the best balance between interpretability and accuracy.

机器学习交通事故预测线性回归XGBoost可解释AI风险评估特征工程SHAP

Published 2026-06-11 02:15Recent activity 2026-06-11 02:23Estimated read 13 min

Road Accident Risk Prediction: A Comparative Study of Nine Machine Learning Models

Section 01

Introduction to Road Accident Risk Prediction Research

Road Accident Risk Prediction: A Comparative Study of Nine Machine Learning Models

Original Authors: Kaumindi Herath, Amasha Fernando, Saviru Mendis, Dilmith Yahathugoda Source: GitHub (Link) Publication Date: 2026-06-10 Course: DS-3003 Machine Learning | Group 11

Core Insight: This study compares the road accident risk prediction performance of nine machine learning models using 112,000 synthetic data records, finding that standard linear regression achieves the best balance between interpretability and accuracy.

Section 02

Research Background and Motivation

Road traffic accidents are one of the leading causes of casualties worldwide. According to the World Health Organization, approximately 1.3 million people die from road traffic accidents each year, and tens of millions are injured. Accurate prediction of road accident risk not only has academic research value but also provides practical guidance for public policy formulation, road design, and driver education.

This study was conducted by four data science students to identify key environmental and structural factors affecting road accident risk and evaluate the prediction performance of various machine learning models. The core question of the study is: Among numerous advanced machine learning algorithms, which model can achieve the best balance between accuracy and interpretability?

Section 03

Dataset and Feature Overview

Data Source and Scale

The study uses the Simulated Roads Accident Data dataset from Kaggle, which is under the CC0 public domain license. The dataset contains approximately 112,000 records, merged from three CSV files (2k, 10k, 100k).

Target Variable

The model's prediction target is accident_risk—a continuous risk score ranging from 0 (low risk) to 1 (high risk).

Feature List

Feature	Type	Description
road_type	Categorical	Road type: Highway, Urban, Rural
num_lanes	Numerical	Number of lanes
speed_limit	Numerical	Speed limit (mph)
curvature	Numerical	Degree of road curvature (0-1)
road_signs_present	Binary	Presence of traffic signs
weather	Categorical	Weather: Clear, Rainy, Foggy
lighting	Categorical	Lighting conditions: Daytime, Nighttime, Dim
time_of_day	Categorical	Time of day: Morning, Afternoon, Evening
holiday	Binary	Whether it is a holiday
school_season	Binary	Whether it is during the school term
public_road	Binary	Whether it is a public road
num_reported_accidents	Numerical	Number of historical accidents on the road segment

Section 04

Research Methods

Exploratory Data Analysis (EDA)

Visualization of feature distributions (histograms, box plots)
Correlation analysis between features
Scatter plots to explore relationships between features and the target variable

Feature Engineering

Binary Feature Construction: Create a high_speed flag to identify road segments with high speed limits
One-Hot Encoding: Perform one-hot encoding on categorical variables and remove reference categories to avoid multicollinearity
Clustering Analysis: Use K-Means for road segment clustering, but ultimately choose a global model instead of cluster-specific models

Model Comparison

The study compares nine machine learning models: Linear Regression, Ridge Regression, Lasso Regression, Elastic Net, Regression Tree, Random Forest, XGBoost, CatBoost, LightGBM.

Evaluation Metrics

MAE (Mean Absolute Error): Average absolute difference between predicted and actual values
RMSE (Root Mean Squared Error): Metric more sensitive to large errors
R² (Coefficient of Determination): Proportion of variance in the target variable explained by the model In addition, compare training and test set performance to detect overfitting.

Section 05

Research Results and Model Performance

Key Risk Factors

Through feature importance analysis and SHAP value interpretation, the following key risk factors were identified:

Road Curvature: The strongest predictor—higher curvature leads to higher risk
Speed Limit: Strong positive correlation with risk
Nighttime Lighting: Reduced visibility significantly increases risk
Adverse Weather: Foggy and rainy conditions increase risk

Model Performance Comparison

Model	MAE	RMSE	R²
Linear Regression ✅	0.0502	0.0632	0.8740
Ridge Regression	0.0502	0.0632	0.8740
Lasso	0.0502	0.0632	0.8740
CatBoost	0.0503	0.0632	0.8739
Elastic Net	0.0503	0.0633	0.8737
XGBoost	0.0040	0.0633	0.8735
LightGBM	0.0509	0.0641	0.8704
Random Forest	0.0542	0.0681	0.8539

Core Findings

Standard linear regression emerged as the optimal model: highest R², lowest error, and no signs of overfitting. This challenges the bias that complex models are better. The advantages of linear regression include strong interpretability, fast training, good generalization ability, and high stability.

Overfitting Analysis

Linear models (Linear Regression, Ridge Regression, etc.) show no overfitting
Tree models (Random Forest, XGBoost, etc.) show slight signs of overfitting
Regression Tree performance lags behind ensemble methods

Section 06

Interpretability Analysis

Coefficient Magnitude Analysis

Linear regression coefficients directly reflect the marginal contribution of each feature to risk. The most influential features are identified through coefficient magnitude plots.

SHAP Value Analysis

SHAP values provide fine-grained interpretation:

Contribution degree of each feature in each prediction
Relationship between feature values and contribution direction (positive/negative)
Global feature importance ranking

Permutation Importance

By randomly shuffling feature values and observing performance degradation, it provides a model-agnostic measure of feature importance. The results are consistent with SHAP and coefficient analysis.

Section 07

Research Limitations and Future Directions

Data Limitations

Synthetic Data: Cannot fully reflect real-world complexity
Geographic Limitation: No geographic location annotations, so regional differences cannot be analyzed
Time Dimension: Lack of time series information, so trend analysis cannot be performed

Model Limitations

Static Prediction: Does not consider dynamic factors such as real-time traffic flow
Causal Relationship: Correlation does not equal causal inference
Extreme Events: Samples of high-risk events may be insufficient

Future Improvement Directions

Real Data Validation: Validate the model on real datasets
Spatio-Temporal Modeling: Introduce time and spatial features
Deep Learning: Try neural networks that capture feature interactions
Real-Time Deployment: Build API services to support real-time risk scoring
Intervention Strategies: Design safety intervention measures based on model insights

Section 08

Implications for Practitioners and Conclusion

Implications

Simplicity First: Use linear regression to establish a baseline first; if it meets requirements, there is no need for complex models
Value of Interpretability: In safety-critical fields, interpretability is more important than precision
Comprehensive Evaluation: Use multiple metrics comprehensively to avoid choosing models with poor generalization ability
Domain Knowledge: Model results need to be cross-validated with professional knowledge

Conclusion

This study demonstrates a complete data science workflow and emphasizes the value of simple tools. For beginners, it is an excellent learning example: clear documentation, complete code, honest analysis, and emphasis on interpretability.

Road Accident Risk Prediction: A Comparative Study of Nine Machine Learning Models

Introduction to Road Accident Risk Prediction Research

Road Accident Risk Prediction: A Comparative Study of Nine Machine Learning Models

Research Background and Motivation

Research Background and Motivation

Dataset and Feature Overview

Dataset and Feature Overview

Data Source and Scale

Target Variable

Feature List

Research Methods

Research Methods

Exploratory Data Analysis (EDA)

Feature Engineering

Model Comparison

Evaluation Metrics

Research Results and Model Performance

Research Results and Model Performance

Key Risk Factors

Model Performance Comparison

Core Findings

Overfitting Analysis

Interpretability Analysis

Interpretability Analysis

Coefficient Magnitude Analysis

SHAP Value Analysis

Permutation Importance

Research Limitations and Future Directions

Research Limitations and Future Directions

Data Limitations

Model Limitations

Future Improvement Directions

Implications for Practitioners and Conclusion

Implications for Practitioners and Conclusion

Implications

Conclusion

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

Graph Neural Networks Revolutionize Global Weather Forecasting: From Graph Weather to Open-Source Practice of Multi-Model Fusion

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization