Zing Forum

Reading

Hands-On Project for Customer Churn Prediction Based on Databricks Lakehouse Architecture

A complete telecom customer analysis and churn prediction project using the Bronze/Silver/Gold three-tier Lakehouse architecture, combining PySpark, Delta Lake, and machine learning to implement the full workflow from data ingestion to business insights

DatabricksLakehouseChurn PredictionPySparkDelta LakeMachine LearningCustomer AnalyticsData Engineering
Published 2026-06-11 21:46Recent activity 2026-06-11 21:51Estimated read 8 min
Hands-On Project for Customer Churn Prediction Based on Databricks Lakehouse Architecture
1

Section 01

Introduction to the Hands-On Project for Customer Churn Prediction Based on Databricks Lakehouse Architecture

This is a complete hands-on project for telecom customer analysis and churn prediction, maintained by Andre-Lutes. The source code is available on GitHub (link: https://github.com/Andre-Lutes/databricks-customer-analytics-churn). The project uses the Bronze/Silver/Gold three-tier Lakehouse architecture, combining PySpark, Delta Lake, and machine learning technologies to implement the full workflow from data ingestion to business insights. It aims to help telecom operators identify customers at risk of churning, support business decisions, and has practical application value.

2

Section 02

Project Background and Objectives

In the telecom industry, the cost of customer churn is far higher than retention cost (the cost of acquiring new customers is more than 5 times that of retaining existing ones). The objectives of this project are to build a complete data pipeline, analyze customer behavior patterns via the Lakehouse architecture, identify key factors of churn, create prediction models to grade customer risk, simulate real enterprise scenarios, and demonstrate the process of building a decision support system from raw data.

3

Section 03

Technology Stack and Architecture Design

Technology Stack: Databricks (unified analytics platform), PySpark (large-scale data processing), Spark SQL (structured query), Delta Lake (reliable storage layer), Python/Pandas (data exploration), Scikit-learn (ML training), Logistic Regression (baseline model).

Three-tier Lakehouse Architecture: Raw data → Bronze layer (ingestion and raw storage) → Silver layer (cleaning, transformation, standardization) → Gold layer (analysis tables and business metrics) → Machine learning (churn prediction) → SQL analysis (business insights). This architecture has clear responsibilities, clear data lineage, and supports incremental processing.

4

Section 04

Details of the Three Data Processing Layers

Bronze Layer: Ingest raw telecom customer data (7043 records, 21 fields), maintain the original state, perform checks on total records, field completeness, and schema, and add metadata like ingestion timestamps.

Silver Layer: Field standardization, data type conversion (handling empty strings in total_charges), feature engineering (binary conversion of churn_flag, grouping of tenure_group, binning of monthly_charges_group). The churn rate after cleaning is 26.54%.

Gold Layer: Create business-oriented analysis tables, such as gold_customer_analytics (comprehensive analysis), gold_churn_kpis (churn KPIs), gold_churn_by_contract (churn rate by contract), and other subject tables.

5

Section 05

Key Business Insights

Key factors affecting churn identified through multi-dimensional analysis:

  1. Contract Type: Monthly contract churn rate is 42.71%, more than 15 times that of two-year contracts (2.83%);
  2. Customer Tenure: New customers with 0-12 months tenure have a churn rate of 47.44%, and loyalty increases with tenure;
  3. Payment Method: Electronic check has the highest churn rate at 45.29%, while automatic payment methods have better retention rates;
  4. Network Service: Fiber optic users have the highest churn rate at 41.89%, possibly related to competition or high expectations.
6

Section 06

Machine Learning Model and Performance

Model Construction: Use Logistic Regression as the baseline model. The process includes data splitting, feature encoding (OneHotEncoder), feature scaling (StandardScaler), training, and risk grading (high >70%, medium 40-70%, low <40%).

Performance Evaluation: Accuracy 80.55%, Precision 65.72%, Recall 55.88%, F1 60.40%, ROC AUC 84.21% (good discriminative ability). Risk group validation shows that predicted probabilities are highly consistent with actual churn rates (actual churn rate of the high-risk group is 74.19%).

7

Section 07

Business Application Value

The project provides a complete churn prediction solution for enterprises:

  • Priority Sorting: Generate risk rankings via the gold_churn_predictions table to prioritize high-risk customers;
  • Precision Marketing: Design differentiated retention strategies for different risk levels and churn factors (e.g., offer long-term contract discounts to monthly contract customers);
  • Real-time Monitoring: Connect tools like Power BI to build a churn monitoring dashboard;
  • Resource Optimization: Concentrate resources on high-risk customers to improve intervention efficiency and ROI.
8

Section 08

Summary and Insights

The project demonstrates the application of the Lakehouse architecture in real business scenarios, with clear design principles for each link from data ingestion to insight generation. Key takeaways: The layered architecture makes the pipeline clear and maintainable; data quality is the foundation of analysis credibility; business insights are more important than model complexity; models need to be combined with business scenarios to generate value. The project provides reference code structure and implementation ideas, suitable for data engineers and analysts to learn modern data architecture best practices.