Zing Forum

Reading

Practical Credit Card Fraud Detection: Machine Learning Solutions for Imbalanced Datasets and Comparison Between XGBoost/LightGBM Models

This article introduces an end-to-end machine learning project for credit card fraud detection, covering advanced feature engineering, SMOTE sampling technique to handle class imbalance, and comparative analysis of two gradient boosting models (XGBoost and LightGBM), ultimately achieving an AUPRC score of 0.8815.

信用卡欺诈检测机器学习不平衡数据集SMOTEXGBoostLightGBMAUPRC特征工程梯度提升
Published 2026-05-20 10:45Recent activity 2026-05-20 10:49Estimated read 4 min
Practical Credit Card Fraud Detection: Machine Learning Solutions for Imbalanced Datasets and Comparison Between XGBoost/LightGBM Models
1

Section 01

Introduction to the Practical Credit Card Fraud Detection Project

This article introduces an end-to-end machine learning project for credit card fraud detection. It uses SMOTE sampling technique to address data imbalance issues, compares XGBoost and LightGBM models, and ultimately achieves an AUPRC score of 0.8815. The project covers the entire workflow including feature engineering, model training, and evaluation, providing a reference for similar problems.

2

Section 02

Project Background and Core Challenges

The core challenge of credit card fraud detection lies in the extreme data imbalance (fraudulent transactions account for an extremely low proportion). Traditional models tend to favor the majority class, resulting in high accuracy but no practical value. Therefore, the project selects AUPRC as the main evaluation metric, which is more suitable for imbalanced scenarios.

3

Section 03

Technical Methods: Feature Engineering and SMOTE Sampling

In terms of feature engineering, time features are creatively processed into cyclic features (sine/cosine components) to capture periodicity; SMOTE is used to synthesize minority class samples (not simple replication, maintaining local structure), and it is only applied to the training set to ensure the authenticity of evaluation.

4

Section 04

Model Comparison: XGBoost vs LightGBM

Comparing the two gradient boosting models: XGBoost achieves an AUPRC of 0.8815 and a recall rate of 86%; LightGBM trains faster and has a precision rate of 93% (fewer false positives). GridSearchCV is used for parameter tuning to ensure optimal configuration.

5

Section 05

Model Evaluation and Business Interpretation

AUPRC is used as the main metric (more sensitive to minority classes). From a business perspective, an 86% recall rate significantly reduces fraud losses, and a 93% precision rate reduces customer distress caused by false alarms; actual deployment requires a trade-off between recall and precision.

6

Section 06

Project Insights and Follow-up Recommendations

The project demonstrates a complete data science workflow, and its code organization (Notebook + scripts + dependency management) is worth learning from; it is an excellent introductory reference for learners. In the future, model interpretability can be explored to support business decisions.