Zing Forum

Reading

End-to-End E-Commerce Logistics Prediction System: Intelligent Delivery Analysis Based on Brazil's Olist Dataset

This article introduces a complete e-commerce logistics prediction system that integrates 9 relational data tables, uses models like XGBoost to predict delivery time, and builds a Streamlit interactive dashboard covering NLP sentiment analysis and multi-dimensional business insights.

电商物流预测XGBoostStreamlit数据工程特征工程NLP情感分析Olist数据集交互式仪表板机器学习交付时间预测
Published 2026-05-08 22:10Recent activity 2026-05-08 22:14Estimated read 7 min
End-to-End E-Commerce Logistics Prediction System: Intelligent Delivery Analysis Based on Brazil's Olist Dataset
1

Section 01

[Main Floor/Introduction] End-to-End E-Commerce Logistics Prediction System: Intelligent Delivery Analysis Based on Brazil's Olist Dataset

This project builds a complete end-to-end e-commerce logistics prediction system based on Brazil's Olist dataset of over 100,000 real orders from 2016 to 2018. It integrates 9 relational data tables and achieves three core functions: delivery time prediction (using models like XGBoost), NLP sentiment analysis for customer satisfaction, and a Streamlit interactive business dashboard. The system covers the entire workflow of data engineering, feature engineering, model training, NLP analysis, and visualization deployment, providing data-driven decision support for e-commerce operations.

2

Section 02

Project Background and Business Value

In the e-commerce industry, logistics delivery time prediction directly affects user experience and operational efficiency. Delayed delivery reduces customer satisfaction, increases customer service costs, and raises return rates. Based on Brazil's Olist public dataset (containing over 100,000 real orders), this project builds an end-to-end system to solve problems of delivery time prediction, customer satisfaction analysis, and business insights, helping to improve operational efficiency and user experience.

3

Section 03

Data Engineering and Feature Engineering

Data Engineering: Integrate 9 relational data tables (customer, seller, product, order, order item, payment, review, geography, etc.), handle missing values, invalid records, and data type issues, remove data leakage features (e.g., actual delivery date), and retain delivered orders to ensure data authenticity.

Feature Engineering: Extract multi-dimensional features: geographic features (customer-seller distance, in-state/cross-state indicator), product features (volume, weight), time features (month/season/holiday), seller performance (average delivery days), payment features (method/number of installments). Use target encoding to process high-cardinality categorical features and log transformation to handle skewed numerical features.

4

Section 04

Model Training and NLP Sentiment Analysis

Model Training: Adopt time-series cross-validation (time-split strategy to avoid information leakage), compare linear regression, random forest, and XGBoost models, and finally select XGBoost (regularization to prevent overfitting, parallel training). Evaluate accuracy using MAE/RMSE/R².

NLP Sentiment Analysis: Preprocess Portuguese reviews (cleaning, tokenization, lemmatization), translate to English via Google Translate, calculate sentiment polarity (positive/negative/neutral) and extract keywords using TextBlob, and find that delayed delivery is highly correlated with negative reviews.

5

Section 05

Streamlit Interactive Dashboard

Develop a 5-page dashboard:

  • Overview page: Total orders, revenue KPIs, monthly trends, state order heatmap, top product ranking;
  • Delivery analysis page: Average delivery days per state, on-time/delayed ratio, in-state vs cross-state efficiency;
  • Prediction page: Real-time input of customer/seller/product information, call XGBoost to predict delivery days;
  • Seller performance page: Top sellers' revenue, delivery speed ranking, growth trend;
  • Customer analysis page: Rating distribution, relationship between delay and rating, state revenue contribution, payment method proportion.
6

Section 06

Challenges and Solutions

Key challenges solved in the project:

  • Complexity of multi-table association: Carefully design merging strategies to handle multi-seller order aggregation;
  • Geographic data noise: Normalize the same zip code using median coordinates;
  • Time-series leakage risk: Strictly split the dataset by time;
  • Multilingual processing: Introduce a translation layer to unify review languages;
  • Skewed feature distribution: Log transformation and binning improve robustness.
7

Section 07

Application Scenarios and Summary

Application Scenarios: Display estimated delivery time before order confirmation, monitor delayed orders for operations, seller rating, optimize inventory via demand prediction.

Summary: This project demonstrates a complete data science workflow from raw data to production-level applications, deeply integrates machine learning with business scenarios, empowers non-technical users through interactive dashboards, and serves as a comprehensive reference case covering data engineering, feature engineering, model training, NLP, and visualization. Future directions can include expanding deep learning models, Transformer-based review understanding, and recommendation systems.