Zing Forum

Reading

Classification of Air Quality Index in Philippine Cities: A Comparative Study of Four Machine Learning Models

A study on the classification of Air Quality Index (AQI) in Philippine urban environments, comparing four models—SVM, LightGBM, CatBoost, and MLP neural network—using pollutant concentration data, with LightGBM showing the best performance.

空气质量指数机器学习LightGBMCatBoostSVM神经网络菲律宾环境数据科学分类算法梯度提升
Published 2026-05-19 22:15Recent activity 2026-05-19 22:18Estimated read 8 min
Classification of Air Quality Index in Philippine Cities: A Comparative Study of Four Machine Learning Models
1

Section 01

[Introduction] Classification of AQI in Philippine Cities: A Comparative Study of Four Machine Learning Models

This study focuses on the classification of Air Quality Index (AQI) in Philippine cities, comparing the performance of four models: Support Vector Machine (SVM), LightGBM, CatBoost, and Multi-Layer Perceptron (MLP) neural network. The key finding is that LightGBM performs best in the classification task of structured environmental sensor data, providing technical references for environmental monitoring departments and a reusable analytical framework for similar cities in other developing countries.

2

Section 02

Research Background and Motivation

Air pollution has become a major environmental and public health challenge in urban areas of the Philippines. Accelerated urbanization has worsened air quality due to industrial emissions, traffic exhaust, and other factors, affecting residents' health. While traditional monitoring can provide real-time data, it lacks intelligent analysis and prediction capabilities. This project focuses on AQI classification in major Philippine cities, comparing the performance of different algorithms through machine learning modeling, aiming to provide technical references for environmental monitoring departments and an analytical framework for similar cities in other developing countries.

3

Section 03

Dataset and Model Methods

Dataset

The dataset used is "PH Philippine Cities Air Quality Index Data 2025" from Kaggle, which includes monitoring records of multiple cities in 2025, preprocessed and merged into a unified dataset. Key variables include pollutant concentrations (CO, NO, NO2, O3, SO2, PM2.5, PM10, NH3), time features (month, day of the week, hour), geographic features (city name), and the target variable is AQI classification (levels 1-5, a multi-class problem).

Model Architecture

Selected four representative models:

  1. SVM: A baseline model with solid theory but low training efficiency on large-scale data;
  2. LightGBM: A gradient boosting framework by Microsoft, optimized for training speed and memory efficiency using histogram algorithms;
  3. CatBoost: An open-source library by Yandex, optimized for handling categorical features;
  4. MLP Neural Network: A fully connected structure using Adam optimizer and cross-entropy loss.

Experimental Workflow

Following the standard workflow: data acquisition → integration → cleaning → feature engineering → stratified sampling (70/15/15) → model training → evaluation (metrics such as accuracy, macro-average F1 score).

4

Section 04

Experimental Results and Analysis

Performance comparison of each model:

Model Accuracy Macro-average F1
LightGBM 0.9969 0.9704
CatBoost 0.9926 0.9338
MLP Neural Network 0.9493 0.8381
SVM 0.8461 0.6010

Analysis: Gradient boosting models (LightGBM, CatBoost) dominated the performance, with LightGBM being the best; MLP performed moderately; SVM had limitations in multi-class imbalance problems. Feature importance analysis showed that pollutant concentrations (PM2.5, PM10, etc.) contributed significantly more than time and geographic features.

5

Section 05

Technical Implementation and Reproducibility

The project provides a complete reproducible solution:

  • Dependency management: requirements.txt defines Python dependencies;
  • Interactive Notebook: includes the full workflow from data download to evaluation;
  • Document output: PDF reports and research papers;
  • Result archiving: models, metrics, and visualizations are stored in the outputs directory.

A fixed random seed is used to ensure result reproducibility, and stratified sampling ensures consistent class ratios in the training/validation/test sets.

6

Section 06

Research Limitations and Future Directions

Limitations

  1. AQI labels are derived from the OpenWeather API; the model reproduces this rule rather than being an independent physical model;
  2. The dataset only covers 2025, lacking cross-year analysis.

Future Directions

  1. Introduce time-series models (LSTM, Transformer) to capture dynamic changes;
  2. Integrate meteorological data (temperature, humidity, wind speed);
  3. Expand the research scope to other developing countries in Southeast Asia.
7

Section 07

Conclusion and Implications

This study provides benchmark results for AQI classification using machine learning. The core conclusion is that gradient boosting models (especially LightGBM) balance accuracy and efficiency best in structured environmental data classification, which is of guiding significance for environmental monitoring departments with limited resources. The open-source implementation provides a reproducible reference for researchers, demonstrating the application potential of data science in the field of environmental sustainability.