# Machine Learning for Predicting Marathon Finish Time: A Data Analysis Practice on the 2023 Boston Marathon

> This article introduces a machine learning project based on the 2023 Boston Marathon dataset (26,598 runners), which predicts the full marathon finish time using age group, gender, and half-marathon performance. It compares linear regression, neural networks, and ensemble tree models, ultimately achieving a high-precision prediction with an RMSE of only 9.42 minutes.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-02T22:12:59.000Z
- 最近活动: 2026-05-03T01:41:34.160Z
- 热度: 147.5
- 关键词: 机器学习, 马拉松预测, 回归分析, 波士顿马拉松, 数据清洗, 模型对比, 集成学习, 体育数据分析
- 页面链接: https://www.zingnex.cn/en/forum/thread/2023
- Canonical: https://www.zingnex.cn/forum/thread/2023
- Markdown 来源: floors_fallback

---

## [Introduction] Machine Learning for Predicting Boston Marathon Finish Time: A High-Precision Model Practice

This article is based on real data from 26,598 runners in the 2023 Boston Marathon. It predicts the full marathon finish time using age group, gender, and half-marathon performance, comparing the baseline ratio method, linear regression, neural networks, and Bagged Trees ensemble model. The Bagged Trees ensemble model ultimately achieves a high-precision prediction with an RMSE of only 9.42 minutes (R²=0.953), verifying the excellent performance of simple models in this task.

## Project Background and Data Source

As one of the six major marathons in the world, the Boston Marathon has data of high research value. This project uses data from 26,598 finishers of the 2023 event. Core features include:
- Age group: 11 age brackets (18-24 to 70+), categorized and encoded
- Gender: binary classification (male/female)
- Half-marathon time: half-marathon performance in seconds (the strongest predictive signal)
The target variable is the net finish time (actual time from start to finish).

## Data Cleaning and Preprocessing Process

1. **Missing Value Handling**: Use MATLAB `rmmissing` to remove records with missing values
2. **Outlier Filtering**: 
   - Positive time validation: Exclude zero values from Did Not Finish (DNF) cases
   - Pace ratio constraint: Remove abnormal data where the ratio of full marathon time to half-marathon time is outside the range 1.80-3.20
3. **Feature Engineering**: 
   - One-hot encoding for categorical variables (to avoid unseen categories in the test set)
   - Standardization of half-marathon time (using only training set mean/standard deviation to prevent data leakage)
4. The dataset is split into training/test sets in an 80/20 ratio, with a random seed of 42 to ensure reproducibility.

## Model Selection and Training Strategy

Four models are compared:
1. **Baseline Ratio Method**: Assumes full marathon time = half-marathon time × average coefficient, providing a performance baseline
2. **Linear Regression**: MATLAB `fitlm`, high interpretability, coefficients reflect feature impact
3. **Neural Network**: Three-layer structure [64,32,16], ReLU activation, captures non-linear interactions
4. **Bagged Trees Ensemble**: MATLAB `fitrensemble`, 150 learners + minimum leaf node size of 10, reduces variance and improves generalization.

## Experimental Results and Key Findings

Test set performance (partial metrics):
| Rank | Model | RMSE (minutes) | R² | ±10-minute Accuracy |
|---|---|---|---|---|
|1| Bagged Trees Ensemble |9.42|0.953|78%|
|2| Linear Regression |9.70|0.951|77%|
|3| Neural Network |9.75|0.950|77%|
|4| Baseline Ratio Method |10.11|0.947|72%|
Key Findings:
- The ensemble model is slightly ahead, but the linear model is more concise
- Half-marathon time is the dominant feature, explaining most of the variance alone
- Simple models (linear regression) perform comparable to complex models, as the problem signal is inherently linear.

## Extended Applications and Technical Implementation

**Binary Classification Extension**: Set finish time thresholds such as 3:00/3:30, converting to a binary decision of "Can finish within X hours", with accuracy ranging from 94% to 98%.
**Technical Implementation**: 
- MATLAB script: `marathon_models.m` end-to-end pipeline
- Open-source resources: cleaned dataset, model packages, result tables
- Interactive website: visualizations of model rankings, precision band analysis, residual plots, etc.

## Practical Insights and Project Summary

**Insights**: 
- Unit selection: Minutes are more intuitive than seconds, and ±X-minute accuracy is more practical
- Reproducibility: Fixed random seeds + standardized processes are key to engineering
- Feature understanding: In-depth analysis of the relationship between half-marathon and full-marathon times is more important than model tuning
**Summary**: The project verifies the "simple model first" principle; linear/lightweight ensemble models are the most cost-effective choice. Open-source resources provide practical references for sports data science.
