Reading

End-to-End E-Commerce Logistics Prediction System: Intelligent Delivery Analysis Based on Brazil's Olist Dataset

This article introduces a complete e-commerce logistics prediction system that integrates 9 relational data tables, uses models like XGBoost to predict delivery time, and builds a Streamlit interactive dashboard covering NLP sentiment analysis and multi-dimensional business insights.

电商物流预测XGBoostStreamlit数据工程特征工程NLP情感分析Olist数据集交互式仪表板机器学习交付时间预测

Published 2026-05-08 22:10Recent activity 2026-05-08 22:14Estimated read 7 min

Section 01

[Main Floor/Introduction] End-to-End E-Commerce Logistics Prediction System: Intelligent Delivery Analysis Based on Brazil's Olist Dataset

This project builds a complete end-to-end e-commerce logistics prediction system based on Brazil's Olist dataset of over 100,000 real orders from 2016 to 2018. It integrates 9 relational data tables and achieves three core functions: delivery time prediction (using models like XGBoost), NLP sentiment analysis for customer satisfaction, and a Streamlit interactive business dashboard. The system covers the entire workflow of data engineering, feature engineering, model training, NLP analysis, and visualization deployment, providing data-driven decision support for e-commerce operations.

Section 02

Project Background and Business Value

In the e-commerce industry, logistics delivery time prediction directly affects user experience and operational efficiency. Delayed delivery reduces customer satisfaction, increases customer service costs, and raises return rates. Based on Brazil's Olist public dataset (containing over 100,000 real orders), this project builds an end-to-end system to solve problems of delivery time prediction, customer satisfaction analysis, and business insights, helping to improve operational efficiency and user experience.

Section 03

Data Engineering and Feature Engineering

Data Engineering: Integrate 9 relational data tables (customer, seller, product, order, order item, payment, review, geography, etc.), handle missing values, invalid records, and data type issues, remove data leakage features (e.g., actual delivery date), and retain delivered orders to ensure data authenticity.

Feature Engineering: Extract multi-dimensional features: geographic features (customer-seller distance, in-state/cross-state indicator), product features (volume, weight), time features (month/season/holiday), seller performance (average delivery days), payment features (method/number of installments). Use target encoding to process high-cardinality categorical features and log transformation to handle skewed numerical features.

Section 04

Model Training and NLP Sentiment Analysis

Model Training: Adopt time-series cross-validation (time-split strategy to avoid information leakage), compare linear regression, random forest, and XGBoost models, and finally select XGBoost (regularization to prevent overfitting, parallel training). Evaluate accuracy using MAE/RMSE/R².

NLP Sentiment Analysis: Preprocess Portuguese reviews (cleaning, tokenization, lemmatization), translate to English via Google Translate, calculate sentiment polarity (positive/negative/neutral) and extract keywords using TextBlob, and find that delayed delivery is highly correlated with negative reviews.

Section 05

Streamlit Interactive Dashboard

Develop a 5-page dashboard:

Overview page: Total orders, revenue KPIs, monthly trends, state order heatmap, top product ranking;
Delivery analysis page: Average delivery days per state, on-time/delayed ratio, in-state vs cross-state efficiency;
Prediction page: Real-time input of customer/seller/product information, call XGBoost to predict delivery days;
Seller performance page: Top sellers' revenue, delivery speed ranking, growth trend;
Customer analysis page: Rating distribution, relationship between delay and rating, state revenue contribution, payment method proportion.

Section 06

Challenges and Solutions

Key challenges solved in the project:

Complexity of multi-table association: Carefully design merging strategies to handle multi-seller order aggregation;
Geographic data noise: Normalize the same zip code using median coordinates;
Time-series leakage risk: Strictly split the dataset by time;
Multilingual processing: Introduce a translation layer to unify review languages;
Skewed feature distribution: Log transformation and binning improve robustness.

Section 07

Application Scenarios and Summary

Application Scenarios: Display estimated delivery time before order confirmation, monitor delayed orders for operations, seller rating, optimize inventory via demand prediction.

Summary: This project demonstrates a complete data science workflow from raw data to production-level applications, deeply integrates machine learning with business scenarios, empowers non-technical users through interactive dashboards, and serves as a comprehensive reference case covering data engineering, feature engineering, model training, NLP, and visualization. Future directions can include expanding deep learning models, Transformer-based review understanding, and recommendation systems.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54