# Road Accident Risk Prediction: A Comparative Study of Nine Machine Learning Models

> A machine learning study on road accident risk prediction compared nine models using 112,000 synthetic data records, finding that standard linear regression achieves the best balance between interpretability and accuracy.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-10T18:15:24.000Z
- 最近活动: 2026-06-10T18:23:47.264Z
- 热度: 159.9
- 关键词: 机器学习, 交通事故预测, 线性回归, XGBoost, 可解释AI, 风险评估, 特征工程, SHAP
- 页面链接: https://www.zingnex.cn/en/forum/thread/geo-github-kaumindiherath-road-accident-risk-prediction
- Canonical: https://www.zingnex.cn/forum/thread/geo-github-kaumindiherath-road-accident-risk-prediction
- Markdown 来源: floors_fallback

---

## Introduction to Road Accident Risk Prediction Research

# Road Accident Risk Prediction: A Comparative Study of Nine Machine Learning Models
**Original Authors**: Kaumindi Herath, Amasha Fernando, Saviru Mendis, Dilmith Yahathugoda
**Source**: GitHub ([Link](https://github.com/KaumindiHerath/Road-accident-risk-prediction))
**Publication Date**: 2026-06-10
**Course**: DS-3003 Machine Learning | Group 11

Core Insight: This study compares the road accident risk prediction performance of nine machine learning models using 112,000 synthetic data records, finding that standard linear regression achieves the best balance between interpretability and accuracy.

## Research Background and Motivation

## Research Background and Motivation
Road traffic accidents are one of the leading causes of casualties worldwide. According to the World Health Organization, approximately 1.3 million people die from road traffic accidents each year, and tens of millions are injured. Accurate prediction of road accident risk not only has academic research value but also provides practical guidance for public policy formulation, road design, and driver education.

This study was conducted by four data science students to identify key environmental and structural factors affecting road accident risk and evaluate the prediction performance of various machine learning models. The core question of the study is: Among numerous advanced machine learning algorithms, which model can achieve the best balance between accuracy and interpretability?

## Dataset and Feature Overview

## Dataset and Feature Overview
### Data Source and Scale
The study uses the *Simulated Roads Accident Data* dataset from Kaggle, which is under the CC0 public domain license. The dataset contains approximately 112,000 records, merged from three CSV files (2k, 10k, 100k).

### Target Variable
The model's prediction target is `accident_risk`—a continuous risk score ranging from 0 (low risk) to 1 (high risk).

### Feature List
| Feature | Type | Description |
|------|------|------|
| road_type | Categorical | Road type: Highway, Urban, Rural |
| num_lanes | Numerical | Number of lanes |
| speed_limit | Numerical | Speed limit (mph) |
| curvature | Numerical | Degree of road curvature (0-1) |
| road_signs_present | Binary | Presence of traffic signs |
| weather | Categorical | Weather: Clear, Rainy, Foggy |
| lighting | Categorical | Lighting conditions: Daytime, Nighttime, Dim |
| time_of_day | Categorical | Time of day: Morning, Afternoon, Evening |
| holiday | Binary | Whether it is a holiday |
| school_season | Binary | Whether it is during the school term |
| public_road | Binary | Whether it is a public road |
| num_reported_accidents | Numerical | Number of historical accidents on the road segment |

## Research Methods

## Research Methods
### Exploratory Data Analysis (EDA)
- Visualization of feature distributions (histograms, box plots)
- Correlation analysis between features
- Scatter plots to explore relationships between features and the target variable

### Feature Engineering
- **Binary Feature Construction**: Create a `high_speed` flag to identify road segments with high speed limits
- **One-Hot Encoding**: Perform one-hot encoding on categorical variables and remove reference categories to avoid multicollinearity
- **Clustering Analysis**: Use K-Means for road segment clustering, but ultimately choose a global model instead of cluster-specific models

### Model Comparison
The study compares nine machine learning models: Linear Regression, Ridge Regression, Lasso Regression, Elastic Net, Regression Tree, Random Forest, XGBoost, CatBoost, LightGBM.

### Evaluation Metrics
- **MAE (Mean Absolute Error)**: Average absolute difference between predicted and actual values
- **RMSE (Root Mean Squared Error)**: Metric more sensitive to large errors
- **R² (Coefficient of Determination)**: Proportion of variance in the target variable explained by the model
In addition, compare training and test set performance to detect overfitting.

## Research Results and Model Performance

## Research Results and Model Performance
### Key Risk Factors
Through feature importance analysis and SHAP value interpretation, the following key risk factors were identified:
1. **Road Curvature**: The strongest predictor—higher curvature leads to higher risk
2. **Speed Limit**: Strong positive correlation with risk
3. **Nighttime Lighting**: Reduced visibility significantly increases risk
4. **Adverse Weather**: Foggy and rainy conditions increase risk

### Model Performance Comparison
| Model | MAE | RMSE | R² |
|------|-----|------|-----|
| Linear Regression ✅ | 0.0502 | 0.0632 | **0.8740** |
| Ridge Regression | 0.0502 | 0.0632 | 0.8740 |
| Lasso | 0.0502 | 0.0632 | 0.8740 |
| CatBoost | 0.0503 | 0.0632 | 0.8739 |
| Elastic Net | 0.0503 | 0.0633 | 0.8737 |
| XGBoost | 0.0040 | 0.0633 | 0.8735 |
| LightGBM | 0.0509 | 0.0641 | 0.8704 |
| Random Forest | 0.0542 | 0.0681 | 0.8539 |

### Core Findings
Standard linear regression emerged as the optimal model: highest R², lowest error, and no signs of overfitting. This challenges the bias that complex models are better. The advantages of linear regression include strong interpretability, fast training, good generalization ability, and high stability.

### Overfitting Analysis
- Linear models (Linear Regression, Ridge Regression, etc.) show no overfitting
- Tree models (Random Forest, XGBoost, etc.) show slight signs of overfitting
- Regression Tree performance lags behind ensemble methods

## Interpretability Analysis

## Interpretability Analysis
### Coefficient Magnitude Analysis
Linear regression coefficients directly reflect the marginal contribution of each feature to risk. The most influential features are identified through coefficient magnitude plots.

### SHAP Value Analysis
SHAP values provide fine-grained interpretation:
- Contribution degree of each feature in each prediction
- Relationship between feature values and contribution direction (positive/negative)
- Global feature importance ranking

### Permutation Importance
By randomly shuffling feature values and observing performance degradation, it provides a model-agnostic measure of feature importance. The results are consistent with SHAP and coefficient analysis.

## Research Limitations and Future Directions

## Research Limitations and Future Directions
### Data Limitations
1. **Synthetic Data**: Cannot fully reflect real-world complexity
2. **Geographic Limitation**: No geographic location annotations, so regional differences cannot be analyzed
3. **Time Dimension**: Lack of time series information, so trend analysis cannot be performed

### Model Limitations
1. **Static Prediction**: Does not consider dynamic factors such as real-time traffic flow
2. **Causal Relationship**: Correlation does not equal causal inference
3. **Extreme Events**: Samples of high-risk events may be insufficient

### Future Improvement Directions
1. **Real Data Validation**: Validate the model on real datasets
2. **Spatio-Temporal Modeling**: Introduce time and spatial features
3. **Deep Learning**: Try neural networks that capture feature interactions
4. **Real-Time Deployment**: Build API services to support real-time risk scoring
5. **Intervention Strategies**: Design safety intervention measures based on model insights

## Implications for Practitioners and Conclusion

## Implications for Practitioners and Conclusion
### Implications
1. **Simplicity First**: Use linear regression to establish a baseline first; if it meets requirements, there is no need for complex models
2. **Value of Interpretability**: In safety-critical fields, interpretability is more important than precision
3. **Comprehensive Evaluation**: Use multiple metrics comprehensively to avoid choosing models with poor generalization ability
4. **Domain Knowledge**: Model results need to be cross-validated with professional knowledge

### Conclusion
This study demonstrates a complete data science workflow and emphasizes the value of simple tools. For beginners, it is an excellent learning example: clear documentation, complete code, honest analysis, and emphasis on interpretability.