# Practical E-Commerce Data Analysis: From SQL Cleaning to Machine Learning for Customer Satisfaction Prediction

> A complete Brazilian e-commerce data analysis project covering SQL data cleaning, Power BI visualization, Python exploratory analysis, and machine learning modeling. The core finding is the decisive impact of delivery timeliness on customer satisfaction.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-14T21:56:07.000Z
- 最近活动: 2026-05-14T22:00:15.927Z
- 热度: 145.9
- 关键词: 电商数据分析, SQL数据清洗, 机器学习, 客户满意度预测, Power BI可视化, Python数据分析, 随机森林, 配送优化, Olist数据集, 端到端数据分析
- 页面链接: https://www.zingnex.cn/en/forum/thread/sql-2bfaa8e7
- Canonical: https://www.zingnex.cn/forum/thread/sql-2bfaa8e7
- Markdown 来源: floors_fallback

---

## [Introduction] Practical E-Commerce Data Analysis: Delivery Timeliness Determines Customer Satisfaction

This project is based on real transaction data from the Brazilian e-commerce platform Olist, completing end-to-end analysis from SQL data cleaning, Power BI visualization, Python exploratory analysis to machine learning modeling. The core finding is: **Delivery performance is the strongest driver of customer satisfaction**. The project covers the complete business chain and provides data-driven decision-making basis for e-commerce operations.

## Project Background and Data Source

Using the Olist Brazilian e-commerce dataset, which includes complete information such as orders, customers, sellers, products, payments, deliveries, and reviews. Key data scale:
- Total orders: 99,441
- Paid orders: 99,440
- Successfully delivered orders: 96,478
- Total payment amount: 16,008,872.12 BRL
- Average review score: 4.09
The completeness of the dataset provides excellent material for analyzing factors affecting customer satisfaction.

## Analysis Methods and Tech Stack

### Data Cleaning (SQL)
Key decisions: Review data aggregation, payment data summarization, product category standardization, order item row merging, business-oriented missing value handling, forming a unified view `analysis_orders_master`.
### Exploratory Analysis
Covers order status distribution, monthly revenue trends, identification of high-revenue categories (health & beauty, watches & gifts, etc.).
### Tech Stack
Data storage: SQLite; Query: SQL; Processing: Python/Pandas; Visualization: Matplotlib/Power BI; Machine learning: Scikit-learn/XGBoost/LightGBM; Development environment: Jupyter/VS Code.

## Core Findings: Delivery Timeliness Dominates Satisfaction

1. **Negative correlation between delivery delay and reviews**: Orders delayed by more than 7 days have the lowest reviews, with an overall delay rate of 8.11%.
2. **Significant regional differences**: Some states have much higher delay rates than average, related to logistics infrastructure, etc.
3. **Risks in high-revenue categories**: Some high-revenue categories have lower reviews, posing satisfaction risks.
4. **Seller risk assessment**: Identify high-risk sellers by combining revenue, delivery performance, and review scores.

## Machine Learning Modeling: Predicting Low-Rating Reviews

### Objective
Binary classification prediction: Low-rating reviews (1) / Non-low-rating reviews (0), excluding un-reviewed orders.
### Feature Engineering
Remove columns directly related to reviews to avoid data leakage; model based on attributes like delivery and payment.
### Model Selection
After testing multiple algorithms, **threshold-tuned Random Forest** was selected, with performance:
- Accuracy: 0.8848
- Precision: 0.6456
- Recall: 0.4727
- F1 score:0.5457
### Feature Importance
Delay days, whether the order was delivered, delivery days, etc., are core features.
### Limitations
It is a post-delivery risk model and cannot provide pre-warning; it is recommended to build a pre-delivery prediction model.

## Application Value and Business Insights

### Visualization Application
Power BI dashboard provides key metrics such as total orders, delay rate, monthly trends, supporting management decision-making.
### Business Insights
1. Delivery experience is a core competitiveness;
2. Data-driven discovery of hidden problems;
3. Prediction models provide references for operations;
4. End-to-end analysis ensures credible insights.
The project provides a complete reference case for e-commerce data analysis learners, demonstrating the value of combining technology and business.
