Reading

Cross-League Football Data Conversion: How Machine Learning Solves the Comparability Challenge of Player Statistics in the Big Five Leagues

This article introduces an innovative machine learning framework for cross-league conversion of player statistics in Europe's Big Five Leagues, and incorporates conformal prediction methods to quantify prediction uncertainty.

足球数据分析机器学习符合预测跨联赛比较体育分析CatBoost球员评估不确定性量化欧洲五大联赛转会分析

Published 2026-06-17 04:45Recent activity 2026-06-17 04:48Estimated read 6 min

Cross-League Football Data Conversion: How Machine Learning Solves the Comparability Challenge of Player Statistics in the Big Five Leagues

Section 01

[Introduction] Machine Learning Solves the Cross-League Comparability Challenge of Player Statistics in the Big Five Leagues

This project, developed by Mohammad Arshan Shaikh and open-sourced on GitHub, aims to achieve cross-league conversion of per-90-minute player statistics in Europe's Big Five Leagues (Premier League, La Liga, Bundesliga, Serie A, Ligue 1) through a machine learning framework. It also incorporates conformal prediction methods to quantify prediction uncertainty, addressing the long-standing league barrier issue that plagues analysts.

Section 02

Research Background: Core Challenges in Cross-League Player Comparison

League Barrier Issue

Differences in competitive level, playing style, and referee standards across leagues lead to misleading direct data comparisons. Traditional methods rely on subjective judgment or simple coefficient adjustments, lacking systematicity and uncertainty assessment.

Core Problem

Focuses on predicting changes in player statistics after transfers between the Big Five Leagues, collecting data from the 2017/18 to 2023/24 seasons, covering 25 league conversion directions.

Section 03

Methodology: Three-Stage Conformal Prediction Framework

1. Transfer Bridge Strategy

Track cross-league transferred players and pair their pre- and post-transfer data (requiring ≥5 full 90-minute appearances). The final dataset includes samples for forwards (202 training / 82 test), midfielders (255 / 109), and defenders (257 / 135).

2. Model Zoo

Compare 12 algorithms: linear models (Ridge/Lasso, etc.), tree ensembles (RandomForest/CatBoost, etc.), robust regression (GBM-Huber, etc.), and neural networks (MLP).

3. Conformal Prediction Framework

V1: 20% validation, coverage rate of 52.5%
V2: 30% validation + interaction features, coverage rate of 83.3%
Mondrian: League-stratified calibration, average coverage rate of 93.2% (meeting the 90% nominal requirement) Conformal prediction is distribution-free and provides finite sample guarantees.

Section 04

Key Findings: Model Performance and Feature Analysis

Overall Performance

Average MAE of 0.3325 (log ratio scale), 6.1% improvement over the mean baseline and 7.6% improvement over the paired mean baseline.

Model Ranking

CatBoost dominates 5 metrics with balanced performance
Progressive passes for midfielders: GBM-Huber is optimal
Interceptions for defenders: LightGBM with interaction features is optimal

Feature Importance

Original statistics from the source league and differences in UEFA coefficients are key predictors.

Section 05

Practical Application Value

Transfer Decisions: Scientifically evaluate the expected performance and confidence level of overseas players
Gambling/Fantasy Football: Build fair valuation models
Academic Research: Submitted to the Journal of Quantitative Analysis in Sports and under review

Section 06

Technical Implementation and Reproducibility

Code structure: MLCLSTData.ipynb (data pipeline), MLCLST.ipynb (modeling process)
Reproducibility: Fixed random seed (42), dependencies managed via requirements.txt
Data: FBref raw data not included, but bridge CSV can be used directly

Section 07

Limitations and Future Directions

Limitations

Small sample size for some league combinations
Simplified position classification (forward/midfielder/defender)
No consideration of temporal dynamics in league level

Future Directions

Incorporate physical fitness data
Refine position classification
Explore deep learning architectures
Regularly update the model

Section 08

Conclusion: Methodological Progress in Sports Data Science

This study not only solves the practical problem of cross-league player comparison but also introduces conformal prediction to quantify uncertainty (rare in sports analysis). The open-source code provides a scientific tool for industry practitioners and demonstrates the application value of conformal prediction to machine learning researchers.