Zing Forum

Reading

Machine Learning for Predicting Marathon Finish Time: A Data Analysis Practice on the 2023 Boston Marathon

This article introduces a machine learning project based on the 2023 Boston Marathon dataset (26,598 runners), which predicts the full marathon finish time using age group, gender, and half-marathon performance. It compares linear regression, neural networks, and ensemble tree models, ultimately achieving a high-precision prediction with an RMSE of only 9.42 minutes.

机器学习马拉松预测回归分析波士顿马拉松数据清洗模型对比集成学习体育数据分析
Published 2026-05-03 06:12Recent activity 2026-05-03 09:41Estimated read 7 min
Machine Learning for Predicting Marathon Finish Time: A Data Analysis Practice on the 2023 Boston Marathon
1

Section 01

[Introduction] Machine Learning for Predicting Boston Marathon Finish Time: A High-Precision Model Practice

This article is based on real data from 26,598 runners in the 2023 Boston Marathon. It predicts the full marathon finish time using age group, gender, and half-marathon performance, comparing the baseline ratio method, linear regression, neural networks, and Bagged Trees ensemble model. The Bagged Trees ensemble model ultimately achieves a high-precision prediction with an RMSE of only 9.42 minutes (R²=0.953), verifying the excellent performance of simple models in this task.

2

Section 02

Project Background and Data Source

As one of the six major marathons in the world, the Boston Marathon has data of high research value. This project uses data from 26,598 finishers of the 2023 event. Core features include:

  • Age group: 11 age brackets (18-24 to 70+), categorized and encoded
  • Gender: binary classification (male/female)
  • Half-marathon time: half-marathon performance in seconds (the strongest predictive signal) The target variable is the net finish time (actual time from start to finish).
3

Section 03

Data Cleaning and Preprocessing Process

  1. Missing Value Handling: Use MATLAB rmmissing to remove records with missing values
  2. Outlier Filtering:
    • Positive time validation: Exclude zero values from Did Not Finish (DNF) cases
    • Pace ratio constraint: Remove abnormal data where the ratio of full marathon time to half-marathon time is outside the range 1.80-3.20
  3. Feature Engineering:
    • One-hot encoding for categorical variables (to avoid unseen categories in the test set)
    • Standardization of half-marathon time (using only training set mean/standard deviation to prevent data leakage)
  4. The dataset is split into training/test sets in an 80/20 ratio, with a random seed of 42 to ensure reproducibility.
4

Section 04

Model Selection and Training Strategy

Four models are compared:

  1. Baseline Ratio Method: Assumes full marathon time = half-marathon time × average coefficient, providing a performance baseline
  2. Linear Regression: MATLAB fitlm, high interpretability, coefficients reflect feature impact
  3. Neural Network: Three-layer structure [64,32,16], ReLU activation, captures non-linear interactions
  4. Bagged Trees Ensemble: MATLAB fitrensemble, 150 learners + minimum leaf node size of 10, reduces variance and improves generalization.
5

Section 05

Experimental Results and Key Findings

Test set performance (partial metrics):

Rank Model RMSE (minutes) ±10-minute Accuracy
1 Bagged Trees Ensemble 9.42 0.953 78%
2 Linear Regression 9.70 0.951 77%
3 Neural Network 9.75 0.950 77%
4 Baseline Ratio Method 10.11 0.947 72%
Key Findings:
  • The ensemble model is slightly ahead, but the linear model is more concise
  • Half-marathon time is the dominant feature, explaining most of the variance alone
  • Simple models (linear regression) perform comparable to complex models, as the problem signal is inherently linear.
6

Section 06

Extended Applications and Technical Implementation

Binary Classification Extension: Set finish time thresholds such as 3:00/3:30, converting to a binary decision of "Can finish within X hours", with accuracy ranging from 94% to 98%. Technical Implementation:

  • MATLAB script: marathon_models.m end-to-end pipeline
  • Open-source resources: cleaned dataset, model packages, result tables
  • Interactive website: visualizations of model rankings, precision band analysis, residual plots, etc.
7

Section 07

Practical Insights and Project Summary

Insights:

  • Unit selection: Minutes are more intuitive than seconds, and ±X-minute accuracy is more practical
  • Reproducibility: Fixed random seeds + standardized processes are key to engineering
  • Feature understanding: In-depth analysis of the relationship between half-marathon and full-marathon times is more important than model tuning Summary: The project verifies the "simple model first" principle; linear/lightweight ensemble models are the most cost-effective choice. Open-source resources provide practical references for sports data science.