Zing Forum

Reading

AeroScrape: Building a Complete Flight Delay Prediction MLOps Pipeline from Scratch

AeroScrape is an end-to-end MLOps project that demonstrates how to build a complete machine learning pipeline from data scraping, storage, model training to API deployment, using Apache Airflow for orchestration and MLflow for model management.

MLOps机器学习航班预测AirflowMLflowFastAPIPostgreSQL
Published 2026-06-04 00:45Recent activity 2026-06-04 00:49Estimated read 6 min
AeroScrape: Building a Complete Flight Delay Prediction MLOps Pipeline from Scratch
1

Section 01

Introduction / Main Post: AeroScrape: Building a Complete Flight Delay Prediction MLOps Pipeline from Scratch

AeroScrape is an end-to-end MLOps project that demonstrates how to build a complete machine learning pipeline from data scraping, storage, model training to API deployment, using Apache Airflow for orchestration and MLflow for model management.

2

Section 02

Original Author and Source


3

Section 03

Project Background and Motivation

Flight delays are a long-standing pain point in the aviation industry, affecting not only passenger experience but also causing huge economic losses for airlines. Traditional delay prediction often relies on simple statistical rules, which struggle to capture complex temporal patterns and interactions between multiple factors. The AeroScrape project emerged to address this, demonstrating a complete MLOps practice case—full-link automation from raw data acquisition to production-level API deployment.

The value of this project lies not only in its technical implementation but also in providing a replicable machine learning engineering template for small and medium-sized teams. For developers looking to turn experimental models into reliable services, AeroScrape's architectural design and engineering practices are highly referenceable.


4

Section 04

System Architecture Overview

AeroScrape adopts a modular microservices architecture with clear separation of responsibilities for each component:

5

Section 05

Data Layer

  • Data Source: Scrape real-time flight takeoff and landing data from fids.airport.ir
  • Storage: PostgreSQL database for persistent storage of raw and processed flight information
  • Data Validation: Use Pandera for raw data quality checks
6

Section 06

Machine Learning Pipeline

The project uses Apache Airflow to orchestrate the complete ML workflow, with the DAG including the following stages:

  1. Data Validation — Use Pandera to ensure input data conforms to the expected schema
  2. Data Cleaning and Feature Engineering — Extract derived features such as time periods, seasons, and holidays
  3. Data Preprocessing — Dataset splitting, standardization, One-Hot encoding
  4. Hyperparameter Tuning — Use Optuna for automatic hyperparameter tuning of LightGBM (regression) and logistic regression (classification) models
  5. Model Training and Evaluation — Train the optimized models and calculate performance metrics
  6. Model Registration — Register the best model version to the MLflow Model Registry
  7. Cleanup — Remove intermediate data to keep the environment clean
7

Section 07

Model Management Layer

MLflow takes on the core responsibilities of experiment tracking and model version management:

  • Automatically record parameters, metrics, and artifacts for each training run
  • Implement model version control and lifecycle management via Model Registry
  • Support alias mechanism (e.g., "production") for easy model version switching in production environments
8

Section 08

Inference Service Layer

The RESTful API built with FastAPI provides real-time prediction capabilities:

  • Dynamically load the registered best model from the MLflow Model Registry
  • Support hot-reload mechanism, allowing model switching without restarting the service
  • Provide health check endpoints for monitoring