Zing Forum

Reading

Practical E-Commerce Data Analysis: From SQL Cleaning to Machine Learning for Customer Satisfaction Prediction

A complete Brazilian e-commerce data analysis project covering SQL data cleaning, Power BI visualization, Python exploratory analysis, and machine learning modeling. The core finding is the decisive impact of delivery timeliness on customer satisfaction.

电商数据分析SQL数据清洗机器学习客户满意度预测Power BI可视化Python数据分析随机森林配送优化Olist数据集端到端数据分析
Published 2026-05-15 05:56Recent activity 2026-05-15 06:00Estimated read 6 min
Practical E-Commerce Data Analysis: From SQL Cleaning to Machine Learning for Customer Satisfaction Prediction
1

Section 01

[Introduction] Practical E-Commerce Data Analysis: Delivery Timeliness Determines Customer Satisfaction

This project is based on real transaction data from the Brazilian e-commerce platform Olist, completing end-to-end analysis from SQL data cleaning, Power BI visualization, Python exploratory analysis to machine learning modeling. The core finding is: Delivery performance is the strongest driver of customer satisfaction. The project covers the complete business chain and provides data-driven decision-making basis for e-commerce operations.

2

Section 02

Project Background and Data Source

Using the Olist Brazilian e-commerce dataset, which includes complete information such as orders, customers, sellers, products, payments, deliveries, and reviews. Key data scale:

  • Total orders: 99,441
  • Paid orders: 99,440
  • Successfully delivered orders: 96,478
  • Total payment amount: 16,008,872.12 BRL
  • Average review score: 4.09 The completeness of the dataset provides excellent material for analyzing factors affecting customer satisfaction.
3

Section 03

Analysis Methods and Tech Stack

Data Cleaning (SQL)

Key decisions: Review data aggregation, payment data summarization, product category standardization, order item row merging, business-oriented missing value handling, forming a unified view analysis_orders_master.

Exploratory Analysis

Covers order status distribution, monthly revenue trends, identification of high-revenue categories (health & beauty, watches & gifts, etc.).

Tech Stack

Data storage: SQLite; Query: SQL; Processing: Python/Pandas; Visualization: Matplotlib/Power BI; Machine learning: Scikit-learn/XGBoost/LightGBM; Development environment: Jupyter/VS Code.

4

Section 04

Core Findings: Delivery Timeliness Dominates Satisfaction

  1. Negative correlation between delivery delay and reviews: Orders delayed by more than 7 days have the lowest reviews, with an overall delay rate of 8.11%.
  2. Significant regional differences: Some states have much higher delay rates than average, related to logistics infrastructure, etc.
  3. Risks in high-revenue categories: Some high-revenue categories have lower reviews, posing satisfaction risks.
  4. Seller risk assessment: Identify high-risk sellers by combining revenue, delivery performance, and review scores.
5

Section 05

Machine Learning Modeling: Predicting Low-Rating Reviews

Objective

Binary classification prediction: Low-rating reviews (1) / Non-low-rating reviews (0), excluding un-reviewed orders.

Feature Engineering

Remove columns directly related to reviews to avoid data leakage; model based on attributes like delivery and payment.

Model Selection

After testing multiple algorithms, threshold-tuned Random Forest was selected, with performance:

  • Accuracy: 0.8848
  • Precision: 0.6456
  • Recall: 0.4727
  • F1 score:0.5457

Feature Importance

Delay days, whether the order was delivered, delivery days, etc., are core features.

Limitations

It is a post-delivery risk model and cannot provide pre-warning; it is recommended to build a pre-delivery prediction model.

6

Section 06

Application Value and Business Insights

Visualization Application

Power BI dashboard provides key metrics such as total orders, delay rate, monthly trends, supporting management decision-making.

Business Insights

  1. Delivery experience is a core competitiveness;
  2. Data-driven discovery of hidden problems;
  3. Prediction models provide references for operations;
  4. End-to-end analysis ensures credible insights. The project provides a complete reference case for e-commerce data analysis learners, demonstrating the value of combining technology and business.