Zing Forum

Reading

Full-Stack Machine Learning Practice: Building a Taxi Trip Duration Prediction System

A detailed explanation of a full-stack ML project based on FastAPI, React, and Apache Spark, showing how to build a scalable travel prediction service from scratch, covering the complete workflow from data processing and model training to deployment.

机器学习全栈开发FastAPIReactApache Spark出行预测工程实践MLOps
Published 2026-05-02 01:16Recent activity 2026-05-02 01:21Estimated read 7 min
Full-Stack Machine Learning Practice: Building a Taxi Trip Duration Prediction System
1

Section 01

Full-Stack Machine Learning Practice: Guide to Taxi Trip Duration Prediction System

This article introduces the "taxi-ml-predictor" project created by maryhansabry, which builds a taxi trip duration prediction system using a full-stack architecture of Apache Spark (data processing), Python ML libraries (model training), FastAPI (backend), and React (frontend). It demonstrates the complete engineering process of machine learning technology from data to product, providing a reference for similar projects.

2

Section 02

Business Background and Technical Challenges

Taxi trip duration prediction is a core issue in the travel field, affecting passenger experience, driver decisions, and platform scheduling. The challenges include: 1. Complex spatio-temporal factors (non-linear interactions such as departure time, weather, traffic congestion, etc.); 2. Data scale and real-time performance (processing massive order data and providing low-latency services); 3. Concept drift (model performance degradation due to time-varying traffic conditions).

3

Section 03

In-depth Analysis of Technical Architecture

Data Processing Layer: Use Apache Spark distributed computing to process massive data, completing cleaning, missing value handling, and feature transformation (converting geospatial features to Manhattan distance/direction angle, decomposing time features into hour/weekday/holiday, etc.). Model Training Layer: Utilize the Python ML ecosystem (e.g., XGBoost/LightGBM), ensuring model generalization through cross-validation and hyperparameter tuning. Service Layer: FastAPI provides asynchronous high-concurrency RESTful APIs with functions such as prediction, health check, and model version management. Presentation Layer: React builds the interface, supporting map point selection, prediction result visualization, and feature importance display.

4

Section 04

Core Machine Learning Methodology

Feature Engineering: Build derived features such as geographic (region coding, Euclidean/Manhattan distance), time (time slot/weekday/holiday), statistical (historical average speed/duration), and interaction (combination of time slot and region) features. Model Selection: Try linear models (baseline), tree ensemble models (capturing non-linearity), and deep learning models (handling high-dimensional sparse features). Evaluation Metrics: Use RMSE, MAE, MAPE (relative error), quantile loss (prediction interval), and long-tail performance (rare long trips).

5

Section 05

Highlights of Engineering Practice

  1. Modular design: Clear responsibilities for data processing, model training, and deployment modules, facilitating expansion and testing; 2. Scalability: Spark and FastAPI support horizontal scaling to handle data/request growth; 3. Full-stack integration: Organically combines data engineering, ML, backend, and frontend, reflecting an end-to-end perspective.
6

Section 06

Practical Application Scenarios and Value

Application scenarios include: 1. Travel platform optimization (dynamic pricing, ETA display, driver scheduling); 2. Logistics route planning (delivery time optimization); 3. Urban planning (support from spatio-temporal distribution data of traffic conditions); 4. Teaching and interviews (ML engineering case covering system design issues).

7

Section 07

Improvement Directions and Reflections

  1. Real-time feature update: Introduce real-time traffic data to improve accuracy, requiring online learning mechanisms; 2. Enhanced model interpretability: Use SHAP tools to refine feature contribution analysis; 3. A/B testing framework: Verify the effect of new models; 4. Edge case handling: Introduce external data sources such as extreme weather/events to deal with special situations.
8

Section 08

Project Insights and Conclusion

Insights: ML engineers need to have end-to-end thinking (understanding the entire chain), pragmatic technology selection (choosing tools on demand), and attach importance to reproducibility and maintainability. Conclusion: This project is a model of ML engineering, covering all links of modern ML system development, and has important reference value for improving ML engineering capabilities. The ability to transform technology into products is the core competitiveness.