Zing Forum

Reading

E-commerce Logistics Delay Prediction: An End-to-End Solution Based on XGBoost and SHAP Explainable AI

This article introduces a complete machine learning project that uses the XGBoost algorithm and SHAP explainability technology to build an e-commerce logistics delay prediction system. The project covers data exploration, feature engineering, model training and evaluation, and provides an interactive Streamlit dashboard to help supply chain teams proactively identify high-risk orders and understand the operational drivers behind delays.

机器学习XGBoostSHAP可解释AI电商物流延迟预测供应链优化Streamlit特征工程数据科学
Published 2026-05-23 21:45Recent activity 2026-05-23 21:48Estimated read 7 min
E-commerce Logistics Delay Prediction: An End-to-End Solution Based on XGBoost and SHAP Explainable AI
1

Section 01

Introduction: End-to-End Solution for E-commerce Logistics Delay Prediction

This article introduces a complete machine learning project that uses the XGBoost algorithm and SHAP explainability technology to build an e-commerce logistics delay prediction system. The project covers data exploration, feature engineering, model training and evaluation, and provides an interactive Streamlit dashboard to help supply chain teams proactively identify high-risk orders and understand the operational drivers behind delays.

2

Section 02

Project Background and Business Challenges

In today's rapidly developing e-commerce industry, the punctuality of logistics delivery directly affects user experience and enterprise operating costs. Delayed deliveries lead to decreased customer satisfaction, increased return and refund costs, and damaged brand reputation. Traditional logistics management relies on experience-based judgment, making it difficult to accurately identify high-risk shipments among massive orders.

This project addresses this pain point by building an end-to-end machine learning prediction system to pre-judge delay risks before orders are dispatched, helping supply chain teams intervene in advance. The project is based on the Kaggle e-commerce logistics dataset (10,999 real records) and achieves accurate prediction through data analysis and modeling.

3

Section 03

Data Exploration and Feature Engineering

Data Overview

The dataset contains 12 original features, with the target variable being whether the delivery was on time (Reached.on.Time_Y.N), showing class imbalance (60% delayed, 40% on time).

Key Insights from Exploratory Analysis

  • The delay rate varies little across different warehouse blocks (58.6%-60.2%), indicating a systemic challenge;
  • The impact of transportation methods (air, land, sea) is limited (delay rate 58.8%-60.2%);
  • Cargo weight is related to delays: the delay rate is higher for the 2-4 kg range, while it is lower for weights over 4 kg.

Feature Engineering

Derived features are designed to enhance model performance:

  • Discount-weight ratio, cost per gram, weight binning, discount category, customer value segmentation, shipment risk labeling, etc.
4

Section 04

Model Construction and Performance Comparison

Comparison of three mainstream models:

Model Accuracy F1 Score ROC-AUC
Logistic Regression 62.6% 0.678 0.618
Random Forest 65.6% 0.679 0.668
XGBoost 68.0% 0.665 0.716

XGBoost was selected as the final model due to its highest ROC-AUC (0.716) and advantages in handling non-linear relationships.

5

Section 05

SHAP Explainability: Analysis of Model Decision Logic

SHAP technology reveals key driving factors:

  • Discount intensity: The delay probability of high-discount orders increases (due to surging orders during promotion periods and resource constraints);
  • Medium-weight cargo: The 2-4 kg range contributes strongly to delays (awkward logistics positioning);
  • Transportation method: Relatively low impact (warehousing and handling links are more critical);
  • Engineered features: Derived features such as cost per gram have high importance.
6

Section 06

Interactive Streamlit Dashboard: A Tool for Business Implementation

The dashboard includes four functional modules:

  • Overview Panel: Displays dataset statistics, delay rate trends, and key metrics;
  • Exploratory Analysis: Visualizes the relationship between features and delays, supporting self-service exploration;
  • Real-time Prediction: Input parameters to get delay risk and SHAP explanations;
  • SHAP Insights: Global feature importance, individual explanations, and dependency graphs.

It meets the needs of different roles (executives, operations, data teams).

7

Section 07

Business Value and Application Prospects

System Value:

  • Proactive Intervention: Take measures such as expedited processing and changing logistics providers for high-risk orders;
  • Resource Optimization: Dynamically allocate warehousing and transportation resources;
  • Customer Experience: Proactively set expectations to reduce negative experiences;
  • Root Cause Analysis: Continuously optimize operational processes through SHAP.

Its open-source nature makes it easy for other e-commerce businesses to learn from, and it will become a standard tool for e-commerce operations in the future.