# Cross-League Football Data Conversion: How Machine Learning Solves the Comparability Challenge of Player Statistics in the Big Five Leagues

> This article introduces an innovative machine learning framework for cross-league conversion of player statistics in Europe's Big Five Leagues, and incorporates conformal prediction methods to quantify prediction uncertainty.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-16T20:45:24.000Z
- 最近活动: 2026-06-16T20:48:54.150Z
- 热度: 163.9
- 关键词: 足球数据分析, 机器学习, 符合预测, 跨联赛比较, 体育分析, CatBoost, 球员评估, 不确定性量化, 欧洲五大联赛, 转会分析
- 页面链接: https://www.zingnex.cn/en/forum/thread/geo-github-mohammadarshan-ml-football-translation
- Canonical: https://www.zingnex.cn/forum/thread/geo-github-mohammadarshan-ml-football-translation
- Markdown 来源: floors_fallback

---

## [Introduction] Machine Learning Solves the Cross-League Comparability Challenge of Player Statistics in the Big Five Leagues

This project, developed by Mohammad Arshan Shaikh and open-sourced on GitHub, aims to achieve cross-league conversion of per-90-minute player statistics in Europe's Big Five Leagues (Premier League, La Liga, Bundesliga, Serie A, Ligue 1) through a machine learning framework. It also incorporates conformal prediction methods to quantify prediction uncertainty, addressing the long-standing league barrier issue that plagues analysts.

## Research Background: Core Challenges in Cross-League Player Comparison

### League Barrier Issue
Differences in competitive level, playing style, and referee standards across leagues lead to misleading direct data comparisons. Traditional methods rely on subjective judgment or simple coefficient adjustments, lacking systematicity and uncertainty assessment.

### Core Problem
Focuses on predicting changes in player statistics after transfers between the Big Five Leagues, collecting data from the 2017/18 to 2023/24 seasons, covering 25 league conversion directions.

## Methodology: Three-Stage Conformal Prediction Framework

### 1. Transfer Bridge Strategy
Track cross-league transferred players and pair their pre- and post-transfer data (requiring ≥5 full 90-minute appearances). The final dataset includes samples for forwards (202 training / 82 test), midfielders (255 / 109), and defenders (257 / 135).

### 2. Model Zoo
Compare 12 algorithms: linear models (Ridge/Lasso, etc.), tree ensembles (RandomForest/CatBoost, etc.), robust regression (GBM-Huber, etc.), and neural networks (MLP).

### 3. Conformal Prediction Framework
- V1: 20% validation, coverage rate of 52.5%
- V2: 30% validation + interaction features, coverage rate of 83.3%
- Mondrian: League-stratified calibration, average coverage rate of 93.2% (meeting the 90% nominal requirement)
Conformal prediction is distribution-free and provides finite sample guarantees.

## Key Findings: Model Performance and Feature Analysis

### Overall Performance
Average MAE of 0.3325 (log ratio scale), 6.1% improvement over the mean baseline and 7.6% improvement over the paired mean baseline.

### Model Ranking
- CatBoost dominates 5 metrics with balanced performance
- Progressive passes for midfielders: GBM-Huber is optimal
- Interceptions for defenders: LightGBM with interaction features is optimal

### Feature Importance
Original statistics from the source league and differences in UEFA coefficients are key predictors.

## Practical Application Value

- **Transfer Decisions**: Scientifically evaluate the expected performance and confidence level of overseas players
- **Gambling/Fantasy Football**: Build fair valuation models
- **Academic Research**: Submitted to the *Journal of Quantitative Analysis in Sports* and under review

## Technical Implementation and Reproducibility

- Code structure: MLCLSTData.ipynb (data pipeline), MLCLST.ipynb (modeling process)
- Reproducibility: Fixed random seed (42), dependencies managed via requirements.txt
- Data: FBref raw data not included, but bridge CSV can be used directly

## Limitations and Future Directions

### Limitations
- Small sample size for some league combinations
- Simplified position classification (forward/midfielder/defender)
- No consideration of temporal dynamics in league level

### Future Directions
- Incorporate physical fitness data
- Refine position classification
- Explore deep learning architectures
- Regularly update the model

## Conclusion: Methodological Progress in Sports Data Science

This study not only solves the practical problem of cross-league player comparison but also introduces conformal prediction to quantify uncertainty (rare in sports analysis). The open-source code provides a scientific tool for industry practitioners and demonstrates the application value of conformal prediction to machine learning researchers.
