Reading

Machine Learning for Predicting Marathon Finish Time: A Data Analysis Practice on the 2023 Boston Marathon

This article introduces a machine learning project based on the 2023 Boston Marathon dataset (26,598 runners), which predicts the full marathon finish time using age group, gender, and half-marathon performance. It compares linear regression, neural networks, and ensemble tree models, ultimately achieving a high-precision prediction with an RMSE of only 9.42 minutes.

机器学习马拉松预测回归分析波士顿马拉松数据清洗模型对比集成学习体育数据分析

Published 2026-05-03 06:12Recent activity 2026-05-03 09:41Estimated read 7 min

Machine Learning for Predicting Marathon Finish Time: A Data Analysis Practice on the 2023 Boston Marathon

Section 01

[Introduction] Machine Learning for Predicting Boston Marathon Finish Time: A High-Precision Model Practice

This article is based on real data from 26,598 runners in the 2023 Boston Marathon. It predicts the full marathon finish time using age group, gender, and half-marathon performance, comparing the baseline ratio method, linear regression, neural networks, and Bagged Trees ensemble model. The Bagged Trees ensemble model ultimately achieves a high-precision prediction with an RMSE of only 9.42 minutes (R²=0.953), verifying the excellent performance of simple models in this task.

Section 02

Project Background and Data Source

As one of the six major marathons in the world, the Boston Marathon has data of high research value. This project uses data from 26,598 finishers of the 2023 event. Core features include:

Age group: 11 age brackets (18-24 to 70+), categorized and encoded
Gender: binary classification (male/female)
Half-marathon time: half-marathon performance in seconds (the strongest predictive signal) The target variable is the net finish time (actual time from start to finish).

Section 03

Data Cleaning and Preprocessing Process

Missing Value Handling: Use MATLAB rmmissing to remove records with missing values
Outlier Filtering:
- Positive time validation: Exclude zero values from Did Not Finish (DNF) cases
- Pace ratio constraint: Remove abnormal data where the ratio of full marathon time to half-marathon time is outside the range 1.80-3.20
Feature Engineering:
- One-hot encoding for categorical variables (to avoid unseen categories in the test set)
- Standardization of half-marathon time (using only training set mean/standard deviation to prevent data leakage)
The dataset is split into training/test sets in an 80/20 ratio, with a random seed of 42 to ensure reproducibility.

Section 04

Model Selection and Training Strategy

Four models are compared:

Baseline Ratio Method: Assumes full marathon time = half-marathon time × average coefficient, providing a performance baseline
Linear Regression: MATLAB fitlm, high interpretability, coefficients reflect feature impact
Neural Network: Three-layer structure [64,32,16], ReLU activation, captures non-linear interactions
Bagged Trees Ensemble: MATLAB fitrensemble, 150 learners + minimum leaf node size of 10, reduces variance and improves generalization.

Section 05

Experimental Results and Key Findings

Test set performance (partial metrics):

Rank	Model	RMSE (minutes)	R²	±10-minute Accuracy
1	Bagged Trees Ensemble	9.42	0.953	78%
2	Linear Regression	9.70	0.951	77%
3	Neural Network	9.75	0.950	77%
4	Baseline Ratio Method	10.11	0.947	72%
Key Findings:

The ensemble model is slightly ahead, but the linear model is more concise
Half-marathon time is the dominant feature, explaining most of the variance alone
Simple models (linear regression) perform comparable to complex models, as the problem signal is inherently linear.

Section 06

Extended Applications and Technical Implementation

Binary Classification Extension: Set finish time thresholds such as 3:00/3:30, converting to a binary decision of "Can finish within X hours", with accuracy ranging from 94% to 98%. Technical Implementation:

MATLAB script: marathon_models.m end-to-end pipeline
Open-source resources: cleaned dataset, model packages, result tables
Interactive website: visualizations of model rankings, precision band analysis, residual plots, etc.

Section 07

Practical Insights and Project Summary

Insights:

Unit selection: Minutes are more intuitive than seconds, and ±X-minute accuracy is more practical
Reproducibility: Fixed random seeds + standardized processes are key to engineering
Feature understanding: In-depth analysis of the relationship between half-marathon and full-marathon times is more important than model tuning Summary: The project verifies the "simple model first" principle; linear/lightweight ensemble models are the most cost-effective choice. Open-source resources provide practical references for sports data science.

Machine Learning for Predicting Marathon Finish Time: A Data Analysis Practice on the 2023 Boston Marathon

[Introduction] Machine Learning for Predicting Boston Marathon Finish Time: A High-Precision Model Practice

Project Background and Data Source

Data Cleaning and Preprocessing Process

Model Selection and Training Strategy

Experimental Results and Key Findings

Extended Applications and Technical Implementation

Practical Insights and Project Summary

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

Graph Neural Networks Revolutionize Global Weather Forecasting: From Graph Weather to Open-Source Practice of Multi-Model Fusion

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization