Zing Forum

Reading

Machine Learning Practice for Credit Card Fraud Detection: From Data Preprocessing to XGBoost Model Deployment

This article provides an in-depth analysis of machine learning-based credit card fraud detection systems, covering the complete implementation process of data preprocessing, class imbalance handling (SMOTE), and XGBoost models.

信用卡欺诈检测机器学习XGBoostSMOTE类别不平衡特征工程金融风控模型解释SHAP生产部署
Published 2026-05-01 08:45Recent activity 2026-05-01 09:55Estimated read 7 min
Machine Learning Practice for Credit Card Fraud Detection: From Data Preprocessing to XGBoost Model Deployment
1

Section 01

Machine Learning Practice for Credit Card Fraud Detection: Guide to Core Processes and Key Technologies

This article focuses on machine learning-based credit card fraud detection systems, covering the complete process including data preprocessing, class imbalance handling (SMOTE), XGBoost model training and tuning, model interpretation (SHAP), and production deployment. It aims to provide practical guidance for building efficient anti-fraud systems.

2

Section 02

Problem Background: Severe Challenges and Unique Difficulties of Financial Fraud

Credit card fraud is a serious problem facing the financial industry, with global annual losses reaching tens of billions of US dollars. Traditional rule-based systems struggle to handle complex fraud methods, making machine learning a powerful tool for anti-fraud. However, it faces four major challenges: extreme class imbalance (the ratio of normal to fraudulent transactions can reach 1000:1), rapid evolution of fraud patterns, real-time requirements (millisecond-level decision-making), and high false positive costs (affecting customer experience and business efficiency).

3

Section 03

Data Preprocessing and Feature Engineering: Building High-Quality Training Sets

Data preprocessing includes missing value handling (using median for numerical features, mode or "unknown" for categorical features) and outlier differentiation (fraud signals or data errors). Feature engineering to mine fraud signals: time features (transaction hour/day of week, interval since last transaction, frequency of time periods), amount features (amount itself, ratio to historical average/credit limit), behavioral features (historical frequency of merchant categories, geographic anomalies, channel changes), and aggregated features (sliding window statistics on transaction count/sum/mean/std of amounts and merchant category distribution).

4

Section 04

Class Imbalance Handling: SMOTE Algorithm and Its Variants

Fraudulent transactions account for only 0.1%-1% of total transactions. Traditional methods (undersampling loses information, oversampling easily overfits, threshold adjustment) have limitations. SMOTE synthesizes minority class samples in feature space: for each minority sample, find k nearest neighbors, randomly select a neighbor, and generate a new sample along the line between them (new sample = original sample + rand(0,1)*(neighbor - original sample)). Variants include Borderline-SMOTE (border sample sampling), ADASYN (adaptive sampling), and SMOTEENN/SMOTETomek (combining data cleaning).

5

Section 05

XGBoost Model: Reasons for Selection and Tuning Strategies

XGBoost advantages: fast parallel training, distributed support, memory optimization; algorithm features: built-in regularization to prevent overfitting, automatic missing value handling, cross-validation and early stopping; interpretability: feature importance, SHAP values. Tuning strategies: scale_pos_weight parameter (number of negative samples / number of positive samples), custom F-beta evaluation metric (focus on recall), threshold optimization (balance precision and recall).

6

Section 06

Complete Pipeline Implementation and Model Evaluation

Data flow architecture: raw data → cleaning → feature engineering → splitting → SMOTE → XGBoost training → evaluation → deployment. Key code includes data preprocessing (standardization, time conversion, splitting), SMOTE processing, XGBoost training (parameter setting, early stopping), and evaluation (classification report, ROC-AUC, confusion matrix). Model interpretation uses SHAP values: global feature importance (e.g., transaction amount, time features) and individual prediction explanations.

7

Section 07

Production Deployment and Monitoring Maintenance

Real-time inference architecture: model serialization (save_model/load_model), ONNX conversion, Triton server; feature storage (Redis in-memory database, precomputed aggregated features, version management); A/B testing (shadow testing, gradual rollout, rollback mechanism). Monitoring: model performance (KS, AUC, prediction drift), feature monitoring (PSI index, correlation changes, data quality), business metrics (fraud capture rate, false positive rate, customer complaint rate, manual review volume).

8

Section 08

Limitations and Improvement Directions

Current limitations: manual feature engineering may miss signals, training data only includes labeled fraud (unknown types cannot be learned), concept drift (model performance decays over time). Improvement directions: deep learning (AutoEncoder, LSTM), graph neural networks (identify gang fraud), online learning (incremental updates to adapt to new patterns), anomaly detection (unsupervised discovery of unknown anomalies).