Zing Forum

Reading

Hadoop Ecosystem-Based Big Data Analysis of Global Life Expectancy: From Data Cleaning to Machine Learning Prediction

This article introduces a complete big data project that uses the WHO global life expectancy dataset, leverages a tech stack including Hadoop, Spark SQL, MLlib, and Cassandra to analyze global life expectancy trends across countries from 2000 to 2019, and builds machine learning prediction models.

大数据HadoopSpark SQL机器学习预期寿命WHO数据清洗Cassandra医疗健康数据分析
Published 2026-06-15 16:16Recent activity 2026-06-15 16:21Estimated read 7 min
Hadoop Ecosystem-Based Big Data Analysis of Global Life Expectancy: From Data Cleaning to Machine Learning Prediction
1

Section 01

Introduction: Overview of the Hadoop Ecosystem-Based Big Data Analysis Project on Global Life Expectancy

This article presents a complete big data project that uses the WHO global life expectancy dataset (2000-2019) and a tech stack including Hadoop, Spark SQL, MLlib, and Cassandra to perform end-to-end analysis from data cleaning to machine learning prediction. It aims to reveal global life expectancy trends, gender differences, and inter-country gaps, and provide actionable healthcare recommendations.

2

Section 02

Project Background and Objectives

Project Background

In the global public health field, life expectancy is a core indicator of residents' health status and the effectiveness of healthcare systems. The WHO's public life expectancy dataset covers data from multiple countries worldwide from 2000 to 2019, providing a foundation for analyzing health trends.

Project Objectives

Developed by a Master's student in Data Science at the National University of Malaysia, this project aims to build an end-to-end big data analysis workflow.Core objectives include: analyzing global life expectancy time trends, comparing gender differences, identifying countries with the highest/lowest life expectancy, applying advanced Spark SQL analysis, building machine learning prediction models, and generating healthcare recommendations.

3

Section 03

Data Architecture and Tech Stack

The project adopts a complete big data ecosystem with core technical components including:

  • Hadoop HDFS: Distributed storage layer that hosts the original WHO dataset
  • Apache Pig: ETL processing for initial data cleaning and formatting
  • Apache Spark: Core computing engine supporting batch processing and interactive queries
  • Spark SQL: Structured data querying and aggregate analysis
  • Spark MLlib: Machine learning algorithm library for training and evaluating prediction models
  • Apache Cassandra: NoSQL database for storing analysis results
  • Zeppelin Notebook: Interactive analysis environment for data exploration and visualization

The multi-layered architecture ensures the complete flow of data from source to insight, with components collaborating to build a scalable, high-performance analysis platform.

4

Section 04

Data Preprocessing and Cleaning Process

The original WHO dataset requires strict quality control with key steps:

  1. Remove null-value records to ensure field completeness
  2. Deduplicate to eliminate interference from duplicate records
  3. Filter relevant features and remove redundant fields
  4. Validate and convert data types
  5. Simplify column names and optimize data structure

The cleaned dataset contains 12,936 valid records covering multiple countries worldwide over a 20-year span, laying the foundation for subsequent analysis.

5

Section 05

Key Findings from Exploratory Data Analysis

Significant Gender Differences

The average life expectancy for women is 72.60 years, and for men it is 67.76 years—women live about 4-5 years longer than men, which aligns with global demographic studies.

Positive Time Trend

Global average life expectancy increased from 66.98 years in 2000 to 72.65 years in 2019, reflecting progress in healthcare, public health, and other fields.

Significant Inter-Country Gaps

Countries with high life expectancy have well-developed healthcare systems and high incomes; countries with low life expectancy face issues like insufficient healthcare resources and poverty, where healthcare investment is key.

6

Section 06

Machine Learning Model Construction and Evaluation Results

Three algorithms were implemented using Spark MLlib:

  • Decision Tree: 44.49% accuracy, high interpretability
  • Random Forest: 44.79% accuracy (highest), good at capturing non-linear patterns
  • Logistic Regression: 39.01% accuracy, high computational efficiency

Evaluation shows that tree-based algorithms are more suitable for healthcare data analysis, with Random Forest being the optimal model choice.

7

Section 07

Core Insights and Policy Recommendations

Key Findings

  1. Global life expectancy increased significantly from 2000 to 2019
  2. Women's life expectancy is consistently higher—gender health gaps need attention
  3. Inter-country life expectancy differences are closely related to economy and healthcare resources
  4. Healthcare system development has a positive impact on life expectancy
  5. ML models can effectively capture patterns in healthcare data

Policy Recommendations

  1. Increase investment in preventive healthcare programs
  2. Improve healthcare accessibility in low-life-expectancy countries
  3. Expand public health education outreach
  4. Formulate policies by learning from the experiences of high-life-expectancy countries