Zing Forum

Reading

Online Retail Customer Churn Prediction: A Complete Solution Based on RFM Feature Engineering and Multi-Model Comparison

The MahletAk/customer-churn-prediction-online-retail project provides a complete online retail customer churn prediction solution. It uses RFM (Recency, Frequency, Monetary) feature engineering methods, combined with multiple machine learning algorithms such as Logistic Regression, Random Forest, XGBoost, and Naive Bayes, to conduct comprehensive model comparison and evaluation on the Online Retail II dataset.

客户流失预测RFM模型在线零售机器学习XGBoost随机森林特征工程客户分析
Published 2026-05-01 07:45Recent activity 2026-05-01 09:47Estimated read 17 min
Online Retail Customer Churn Prediction: A Complete Solution Based on RFM Feature Engineering and Multi-Model Comparison
1

Section 01

Guide to the Complete Online Retail Customer Churn Prediction Solution

This project provides a complete solution for online retail customer churn prediction. The core is based on RFM (Recency, Frequency, Monetary) feature engineering methods, combined with multiple machine learning algorithms such as Logistic Regression, Random Forest, XGBoost, and Naive Bayes, to conduct comprehensive model comparison and evaluation on the Online Retail II dataset. The solution covers the entire process from data preprocessing to model deployment, aiming to help enterprises accurately predict customer churn risk, take intervention measures in advance, and improve customer retention rate and profits.

2

Section 02

Project Background and Business Value

In the highly competitive e-commerce market, customer churn is one of the biggest challenges enterprises face. Studies show that the cost of acquiring new customers is more than five times that of retaining existing ones, and a 5% increase in customer retention rate can lead to a 25% to 95% increase in enterprise profits. Therefore, accurately predicting which customers may churn and taking intervention measures in advance has great commercial value for online retail enterprises.

The MahletAk/customer-churn-prediction-online-retail project provides a complete customer churn prediction solution. Based on the classic Online Retail II dataset, it uses RFM feature engineering methods and combines multiple machine learning algorithms to build a reproducible and scalable customer churn prediction system.

3

Section 03

RFM Feature Engineering Framework and Implementation

RFM Model Principles

RFM model is one of the most classic analysis methods in customer relationship management, describing customer value through three dimensions:

  • R (Recency):Time since the customer's last purchase. The shorter the time, the higher the customer activity and the lower the churn risk.
  • F (Frequency):Number of purchases by the customer within a specific period. Higher frequency indicates higher customer loyalty.
  • M (Monetary):Total consumption amount of the customer within a specific period. Higher amount means greater customer value.

Feature Engineering Implementation

The project extends RFM with deep feature expansion to build a richer feature set:

Basic RFM Features

  • recency_days:Days since last purchase
  • frequency:Total number of historical orders
  • monetary_value:Total historical consumption amount
  • avg_order_value:Average order value

Extended Behavioral Features

  • purchase_velocity:Purchase velocity (number of orders / active days)
  • days_between_purchases:Average interval between purchases
  • unique_products:Number of different product categories purchased
  • return_rate:Return rate
  • peak_activity_hour:Most active purchase time slot

Time Series Features

  • purchase_trend:Recent purchase trend (upward/downward/stable)
  • seasonality_score:Purchase seasonality index
  • weekend_purchase_ratio:Ratio of weekend purchases

This multi-dimensional feature engineering method significantly improves the predictive ability of the model. Compared with using only basic RFM features, it can capture more complex customer behavior patterns.

4

Section 04

Dataset Introduction and Preprocessing Process

Data Source and Characteristics

The Online Retail II dataset comes from a UK online retail company, spanning from December 2009 to December 2011, containing over 1 million transaction records. The dataset features include:

  • Real business scenario:Transaction data from an actual e-commerce platform
  • Multi-country customers:Customers from multiple countries and regions globally
  • Product diversity:Covers various product categories
  • Long time span:Two years of complete transaction history

Data Preprocessing Process

The project implements a complete data preprocessing pipeline:

Data Cleaning

  • Missing value handling:Delete records with missing Customer ID
  • Outlier handling:Identify and process return records and negative amount transactions
  • Duplicate records:Detect and handle duplicate transactions

Feature Construction

  • Customer identification:Aggregate transaction records based on Customer ID
  • Time window:Define observation and prediction periods
  • Churn label:Define customer churn based on specific rules (e.g., no purchase for 90 days)

Data Partitioning

  • Train/test split:Split by time sequence to avoid data leakage
  • Class balance:Handle imbalanced positive and negative samples
5

Section 05

Machine Learning Model Comparison and Evaluation System

Model Comparison

The project systematically compares the performance of four mainstream machine learning algorithms in customer churn prediction:

  • Logistic Regression:Baseline model with strong interpretability, high computational efficiency, and probability output
  • Random Forest:Ensemble learning method that can calculate feature importance, resist overfitting, and capture non-linear relationships
  • XGBoost:Gradient boosting decision tree with built-in regularization, automatic missing value handling, and parallel computing support
  • Naive Bayes:Based on conditional independence assumption, fast training, low memory usage, and friendly to small samples

Evaluation System

The project uses a comprehensive model evaluation index system:

Classification Performance Metrics

  • Accuracy:Proportion of correctly predicted samples
  • Precision:Proportion of actual churn customers among those predicted to churn
  • Recall:Proportion of actual churn customers correctly predicted
  • F1-score:Harmonic mean of precision and recall

Ranking Performance Metrics

  • AUC-ROC:Area under the ROC curve, measuring the model's ability to distinguish between positive and negative samples
  • AUC-PR:Area under the precision-recall curve, more sensitive to imbalanced data

Comprehensive Performance Metrics

  • MCC (Matthews Correlation Coefficient):Correlation coefficient considering all classification results

Business Value Metrics

  • Cost-benefit analysis:Marketing cost and revenue at different thresholds
  • Customer Lifetime Value (CLV):Customer value evaluation combined with churn prediction
6

Section 06

Experimental Results and Key Insights

Model Performance Comparison

Experimental results show that XGBoost and Random Forest perform best in this task, significantly outperforming Logistic Regression and Naive Bayes:

  • XGBoost:Achieves optimal performance in AUC and F1-score
  • Random Forest:Excels in precision, suitable for scenarios sensitive to false positives
  • Logistic Regression:Robust as a baseline model with strong interpretability
  • Naive Bayes:Fastest training speed but relatively low prediction accuracy

Feature Importance Analysis

Through feature importance analysis of Random Forest and XGBoost, the most influential predictors are:

  1. recency_days:Recent purchase time is the strongest indicator of churn
  2. frequency:Purchase frequency reflects customer loyalty
  3. days_between_purchases:Changes in purchase interval indicate churn risk
  4. monetary_value:Consumption amount is related to customer value
  5. purchase_trend:Changes in purchase trend are early warning signals

Threshold Optimization

The project explores the impact of different classification thresholds on business results:

  • High threshold strategy:Prioritize prediction accuracy, suitable for scenarios with limited marketing budget
  • Low threshold strategy:Maximize coverage of potential churn customers, suitable for scenarios with sufficient customer retention budget
7

Section 07

Practical Application Deployment and Intervention Strategy Recommendations

Model Deployment Architecture

The project provides a reference architecture for model deployment:

Batch Prediction Mode

  • Regular operation:Update customer churn risk scores in batches daily/weekly
  • Data warehouse integration:Obtain the latest transaction data from the data warehouse
  • Result storage:Write prediction results into the customer relationship management system

Real-time Prediction Mode

  • API service:Encapsulate the model as a RESTful API
  • Stream processing:Update risk scores based on real-time transaction events
  • Trigger mechanism:Initiate intervention process when risk score exceeds threshold

Intervention Strategy Recommendations

Based on churn prediction results, adopt stratified intervention strategies:

High-risk customers (churn probability >80%)

  • Dedicated customer service:Manual outbound calls to understand churn reasons
  • Customized offers:Provide personalized coupons or discounts
  • Product recommendations:Recommend related products based on historical purchases

Medium-risk customers (churn probability 50%-80%)

  • Email marketing:Send personalized marketing emails
  • Point incentives:Offer double points activities
  • Usage guidance:Push product usage tips and value points

Low-risk customers (churn probability <50%)

  • Regular maintenance:Maintain normal customer relationship management
  • Value enhancement:Recommend high-value products or service upgrades

Continuous Optimization Mechanism

  • Regular retraining:Retrain the model with the latest data monthly/quarterly
  • A/B testing:Compare the effects of different intervention strategies
  • Feedback loop:Collect intervention results to optimize the prediction model
8

Section 08

Project Technical Highlights and Summary Insights

Project Technical Highlights

  • Code organization:Clear structure (data/features/models/notebooks/utils)
  • Reproducibility guarantee:Fixed random seeds, dependency management (requirements.txt), configuration separation
  • Comprehensive documentation:README.md, data dictionary, experiment reports

Summary Insights

The project provides a complete customer churn prediction solution from data preprocessing to model deployment. Its core contributions include:

  1. Deep application of RFM feature engineering:Combination of classic framework and modern machine learning
  2. Systematic multi-model comparison:Provides empirical basis for model selection
  3. Comprehensive evaluation system:Two-dimensional evaluation of technical indicators and business value
  4. Practical deployment recommendations:Close integration of technical solutions and business scenarios

For data scientists and business analysts, this project is an excellent learning resource and practical reference, demonstrating the complete thinking process from business problems to data science solutions.