Zing Forum

Reading

End-to-End Credit Card Fraud Detection on Databricks: A Hands-On Guide to the Medallion Architecture

This project demonstrates how to build a complete credit card fraud detection pipeline on the Databricks platform using PySpark, SparkSQL, and Spark MLlib. It uses the Medallion architecture to process over 280,000 real transaction records, achieving high recall for fraud identification through class imbalance handling, feature engineering, and a Random Forest model.

欺诈检测PySparkDatabricks机器学习类别不平衡随机森林特征工程数据工程SparkSQL金融风控
Published 2026-05-26 18:15Recent activity 2026-05-26 18:23Estimated read 6 min
End-to-End Credit Card Fraud Detection on Databricks: A Hands-On Guide to the Medallion Architecture
1

Section 01

Introduction to the End-to-End Credit Card Fraud Detection Project on Databricks

This project shows how to build a complete credit card fraud detection pipeline on the Databricks platform using PySpark, SparkSQL, and Spark MLlib. It uses the Medallion architecture to process over 280,000 real transaction records, achieving high recall for fraud identification through class imbalance handling, feature engineering, and a Random Forest model. The project is sourced from GitHub (maintained by amanthakur-dev) with the original title 'Credit Card Fraud Detection Pipeline'.

2

Section 02

Project Background and Core Challenges

Credit card fraud detection is a classic problem in fintech. This project is based on real transaction data from European cardholders in September 2013 (284,807 transactions, with only 492 fraudulent ones accounting for 0.17%). Core challenges include: extreme class imbalance (rendering traditional accuracy metrics ineffective), real-time requirements (millisecond-level inference), interpretability needs (financial institutions require decision-making basis), and data privacy (secure handling of sensitive information).

3

Section 03

Detailed Design of the Medallion Architecture

The Databricks Medallion architecture is used with three processing layers:

  • Bronze Layer: Raw data ingestion (CSV loading, schema validation, adding ingestion timestamps);
  • Silver Layer: Feature engineering and cleaning (time feature extraction, logarithmic transformation of amount, class weight calculation, partitioning by Class);
  • Gold Layer: Business metrics and model predictions (7 KPI tables for fraud patterns, model prediction results, feature importance analysis, visualization export).
4

Section 04

Feature Engineering and Fraud Behavior Insights

Seven key features are designed: hour_of_day (time period aggregation), day_number (cross-date patterns), is_night (high risk in early morning), amount_log (reducing outliers), is_small_amount (card testing), is_large_amount (high-value risk), class_weight (minority class weight). Key insights: Higher fraud rate at night, small transactions (<$10) are often card tests, and the number of frauds is higher the next day.

5

Section 05

Model Training and Performance Evaluation

Training process: Silver layer data → VectorAssembler → StandardScaler → 80/20 split → Logistic Regression (baseline) and Random Forest (100 trees, depth 10). Performance comparison: Random Forest performs better in metrics like recall (0.9995) and F1 (0.9995). Since the cost of missing fraud is high in fraud detection, Random Forest is chosen. Feature importance: V14, V17, V12 are the top 3 original features; amount_log and hour_of_day are among the top 15.

6

Section 06

Engineering Optimization and Practical Application Value

Optimization practices: Partition by Class to avoid full table scans, use weight adjustment to handle class imbalance, cache temporary views for repeated queries, and use lazy materialization to reduce transformation costs. Reproducibility: Fixed seed=42, Notebook integration with Git, data lineage tracking. Application value: The Medallion architecture can be migrated to other risk control scenarios; weight strategies can be reused for imbalanced classification; feature engineering ideas are general; Spark techniques are widely applicable.

7

Section 07

Project Limitations and Improvement Directions

Current limitations: Data timeliness (2013 historical data), feature anonymization (V1-V28 are PCA-reduced, limiting interpretability), single model strategy. Improvement directions: Real-time stream processing (Spark Streaming/Delta Live Tables), model interpretability (SHAP values), anomaly detection (Isolation Forest), A/B testing framework.