Reading

Online Retail Customer Churn Prediction: A Complete Solution Based on RFM Feature Engineering and Multi-Model Comparison

The MahletAk/customer-churn-prediction-online-retail project provides a complete online retail customer churn prediction solution. It uses RFM (Recency, Frequency, Monetary) feature engineering methods, combined with multiple machine learning algorithms such as Logistic Regression, Random Forest, XGBoost, and Naive Bayes, to conduct comprehensive model comparison and evaluation on the Online Retail II dataset.

客户流失预测RFM模型在线零售机器学习XGBoost随机森林特征工程客户分析

Published 2026-05-01 07:45Recent activity 2026-05-01 09:47Estimated read 17 min

Online Retail Customer Churn Prediction: A Complete Solution Based on RFM Feature Engineering and Multi-Model Comparison

Section 01

Guide to the Complete Online Retail Customer Churn Prediction Solution

This project provides a complete solution for online retail customer churn prediction. The core is based on RFM (Recency, Frequency, Monetary) feature engineering methods, combined with multiple machine learning algorithms such as Logistic Regression, Random Forest, XGBoost, and Naive Bayes, to conduct comprehensive model comparison and evaluation on the Online Retail II dataset. The solution covers the entire process from data preprocessing to model deployment, aiming to help enterprises accurately predict customer churn risk, take intervention measures in advance, and improve customer retention rate and profits.

Section 02

Project Background and Business Value

In the highly competitive e-commerce market, customer churn is one of the biggest challenges enterprises face. Studies show that the cost of acquiring new customers is more than five times that of retaining existing ones, and a 5% increase in customer retention rate can lead to a 25% to 95% increase in enterprise profits. Therefore, accurately predicting which customers may churn and taking intervention measures in advance has great commercial value for online retail enterprises.

The MahletAk/customer-churn-prediction-online-retail project provides a complete customer churn prediction solution. Based on the classic Online Retail II dataset, it uses RFM feature engineering methods and combines multiple machine learning algorithms to build a reproducible and scalable customer churn prediction system.

Section 03

RFM Feature Engineering Framework and Implementation

RFM Model Principles

RFM model is one of the most classic analysis methods in customer relationship management, describing customer value through three dimensions:

R (Recency)：Time since the customer's last purchase. The shorter the time, the higher the customer activity and the lower the churn risk.
F (Frequency)：Number of purchases by the customer within a specific period. Higher frequency indicates higher customer loyalty.
M (Monetary)：Total consumption amount of the customer within a specific period. Higher amount means greater customer value.

Feature Engineering Implementation

The project extends RFM with deep feature expansion to build a richer feature set:

Basic RFM Features

recency_days：Days since last purchase
frequency：Total number of historical orders
monetary_value：Total historical consumption amount
avg_order_value：Average order value

Extended Behavioral Features

purchase_velocity：Purchase velocity (number of orders / active days)
days_between_purchases：Average interval between purchases
unique_products：Number of different product categories purchased
return_rate：Return rate
peak_activity_hour：Most active purchase time slot

Time Series Features

purchase_trend：Recent purchase trend (upward/downward/stable)
seasonality_score：Purchase seasonality index
weekend_purchase_ratio：Ratio of weekend purchases

This multi-dimensional feature engineering method significantly improves the predictive ability of the model. Compared with using only basic RFM features, it can capture more complex customer behavior patterns.

Section 04

Dataset Introduction and Preprocessing Process

Data Source and Characteristics

The Online Retail II dataset comes from a UK online retail company, spanning from December 2009 to December 2011, containing over 1 million transaction records. The dataset features include:

Real business scenario：Transaction data from an actual e-commerce platform
Multi-country customers：Customers from multiple countries and regions globally
Product diversity：Covers various product categories
Long time span：Two years of complete transaction history

Data Preprocessing Process

The project implements a complete data preprocessing pipeline:

Data Cleaning

Missing value handling：Delete records with missing Customer ID
Outlier handling：Identify and process return records and negative amount transactions
Duplicate records：Detect and handle duplicate transactions

Feature Construction

Customer identification：Aggregate transaction records based on Customer ID
Time window：Define observation and prediction periods
Churn label：Define customer churn based on specific rules (e.g., no purchase for 90 days)

Data Partitioning

Train/test split：Split by time sequence to avoid data leakage
Class balance：Handle imbalanced positive and negative samples

Section 05

Machine Learning Model Comparison and Evaluation System

Model Comparison

The project systematically compares the performance of four mainstream machine learning algorithms in customer churn prediction:

Logistic Regression：Baseline model with strong interpretability, high computational efficiency, and probability output
Random Forest：Ensemble learning method that can calculate feature importance, resist overfitting, and capture non-linear relationships
XGBoost：Gradient boosting decision tree with built-in regularization, automatic missing value handling, and parallel computing support
Naive Bayes：Based on conditional independence assumption, fast training, low memory usage, and friendly to small samples

Evaluation System

The project uses a comprehensive model evaluation index system:

Classification Performance Metrics

Accuracy：Proportion of correctly predicted samples
Precision：Proportion of actual churn customers among those predicted to churn
Recall：Proportion of actual churn customers correctly predicted
F1-score：Harmonic mean of precision and recall

Ranking Performance Metrics

AUC-ROC：Area under the ROC curve, measuring the model's ability to distinguish between positive and negative samples
AUC-PR：Area under the precision-recall curve, more sensitive to imbalanced data

Comprehensive Performance Metrics

MCC (Matthews Correlation Coefficient)：Correlation coefficient considering all classification results

Business Value Metrics

Cost-benefit analysis：Marketing cost and revenue at different thresholds
Customer Lifetime Value (CLV)：Customer value evaluation combined with churn prediction

Section 06

Experimental Results and Key Insights

Model Performance Comparison

Experimental results show that XGBoost and Random Forest perform best in this task, significantly outperforming Logistic Regression and Naive Bayes:

XGBoost：Achieves optimal performance in AUC and F1-score
Random Forest：Excels in precision, suitable for scenarios sensitive to false positives
Logistic Regression：Robust as a baseline model with strong interpretability
Naive Bayes：Fastest training speed but relatively low prediction accuracy

Feature Importance Analysis

Through feature importance analysis of Random Forest and XGBoost, the most influential predictors are:

recency_days：Recent purchase time is the strongest indicator of churn
frequency：Purchase frequency reflects customer loyalty
days_between_purchases：Changes in purchase interval indicate churn risk
monetary_value：Consumption amount is related to customer value
purchase_trend：Changes in purchase trend are early warning signals

Threshold Optimization

The project explores the impact of different classification thresholds on business results:

High threshold strategy：Prioritize prediction accuracy, suitable for scenarios with limited marketing budget
Low threshold strategy：Maximize coverage of potential churn customers, suitable for scenarios with sufficient customer retention budget

Section 07

Practical Application Deployment and Intervention Strategy Recommendations

Model Deployment Architecture

The project provides a reference architecture for model deployment:

Batch Prediction Mode

Regular operation：Update customer churn risk scores in batches daily/weekly
Data warehouse integration：Obtain the latest transaction data from the data warehouse
Result storage：Write prediction results into the customer relationship management system

Real-time Prediction Mode

API service：Encapsulate the model as a RESTful API
Stream processing：Update risk scores based on real-time transaction events
Trigger mechanism：Initiate intervention process when risk score exceeds threshold

Intervention Strategy Recommendations

Based on churn prediction results, adopt stratified intervention strategies:

High-risk customers (churn probability >80%)

Dedicated customer service：Manual outbound calls to understand churn reasons
Customized offers：Provide personalized coupons or discounts
Product recommendations：Recommend related products based on historical purchases

Medium-risk customers (churn probability 50%-80%)

Email marketing：Send personalized marketing emails
Point incentives：Offer double points activities
Usage guidance：Push product usage tips and value points

Low-risk customers (churn probability <50%)

Regular maintenance：Maintain normal customer relationship management
Value enhancement：Recommend high-value products or service upgrades

Continuous Optimization Mechanism

Regular retraining：Retrain the model with the latest data monthly/quarterly
A/B testing：Compare the effects of different intervention strategies
Feedback loop：Collect intervention results to optimize the prediction model

Section 08

Project Technical Highlights and Summary Insights

Project Technical Highlights

Code organization：Clear structure (data/features/models/notebooks/utils)
Reproducibility guarantee：Fixed random seeds, dependency management (requirements.txt), configuration separation
Comprehensive documentation：README.md, data dictionary, experiment reports

Summary Insights

The project provides a complete customer churn prediction solution from data preprocessing to model deployment. Its core contributions include:

Deep application of RFM feature engineering：Combination of classic framework and modern machine learning
Systematic multi-model comparison：Provides empirical basis for model selection
Comprehensive evaluation system：Two-dimensional evaluation of technical indicators and business value
Practical deployment recommendations：Close integration of technical solutions and business scenarios

For data scientists and business analysts, this project is an excellent learning resource and practical reference, demonstrating the complete thinking process from business problems to data science solutions.