Reading

Classification of Air Quality Index in Philippine Cities: A Comparative Study of Four Machine Learning Models

A study on the classification of Air Quality Index (AQI) in Philippine urban environments, comparing four models—SVM, LightGBM, CatBoost, and MLP neural network—using pollutant concentration data, with LightGBM showing the best performance.

空气质量指数机器学习LightGBMCatBoostSVM神经网络菲律宾环境数据科学分类算法梯度提升

Published 2026-05-19 22:15Recent activity 2026-05-19 22:18Estimated read 8 min

Classification of Air Quality Index in Philippine Cities: A Comparative Study of Four Machine Learning Models

Section 01

[Introduction] Classification of AQI in Philippine Cities: A Comparative Study of Four Machine Learning Models

This study focuses on the classification of Air Quality Index (AQI) in Philippine cities, comparing the performance of four models: Support Vector Machine (SVM), LightGBM, CatBoost, and Multi-Layer Perceptron (MLP) neural network. The key finding is that LightGBM performs best in the classification task of structured environmental sensor data, providing technical references for environmental monitoring departments and a reusable analytical framework for similar cities in other developing countries.

Section 02

Research Background and Motivation

Air pollution has become a major environmental and public health challenge in urban areas of the Philippines. Accelerated urbanization has worsened air quality due to industrial emissions, traffic exhaust, and other factors, affecting residents' health. While traditional monitoring can provide real-time data, it lacks intelligent analysis and prediction capabilities. This project focuses on AQI classification in major Philippine cities, comparing the performance of different algorithms through machine learning modeling, aiming to provide technical references for environmental monitoring departments and an analytical framework for similar cities in other developing countries.

Section 03

Dataset and Model Methods

Dataset

The dataset used is "PH Philippine Cities Air Quality Index Data 2025" from Kaggle, which includes monitoring records of multiple cities in 2025, preprocessed and merged into a unified dataset. Key variables include pollutant concentrations (CO, NO, NO2, O3, SO2, PM2.5, PM10, NH3), time features (month, day of the week, hour), geographic features (city name), and the target variable is AQI classification (levels 1-5, a multi-class problem).

Model Architecture

Selected four representative models:

SVM: A baseline model with solid theory but low training efficiency on large-scale data;
LightGBM: A gradient boosting framework by Microsoft, optimized for training speed and memory efficiency using histogram algorithms;
CatBoost: An open-source library by Yandex, optimized for handling categorical features;
MLP Neural Network: A fully connected structure using Adam optimizer and cross-entropy loss.

Experimental Workflow

Following the standard workflow: data acquisition → integration → cleaning → feature engineering → stratified sampling (70/15/15) → model training → evaluation (metrics such as accuracy, macro-average F1 score).

Section 04

Experimental Results and Analysis

Performance comparison of each model:

Model	Accuracy	Macro-average F1
LightGBM	0.9969	0.9704
CatBoost	0.9926	0.9338
MLP Neural Network	0.9493	0.8381
SVM	0.8461	0.6010

Analysis: Gradient boosting models (LightGBM, CatBoost) dominated the performance, with LightGBM being the best; MLP performed moderately; SVM had limitations in multi-class imbalance problems. Feature importance analysis showed that pollutant concentrations (PM2.5, PM10, etc.) contributed significantly more than time and geographic features.

Section 05

Technical Implementation and Reproducibility

The project provides a complete reproducible solution:

Dependency management: requirements.txt defines Python dependencies;
Interactive Notebook: includes the full workflow from data download to evaluation;
Document output: PDF reports and research papers;
Result archiving: models, metrics, and visualizations are stored in the outputs directory.

A fixed random seed is used to ensure result reproducibility, and stratified sampling ensures consistent class ratios in the training/validation/test sets.

Section 06

Research Limitations and Future Directions

Limitations

AQI labels are derived from the OpenWeather API; the model reproduces this rule rather than being an independent physical model;
The dataset only covers 2025, lacking cross-year analysis.

Future Directions

Introduce time-series models (LSTM, Transformer) to capture dynamic changes;
Integrate meteorological data (temperature, humidity, wind speed);
Expand the research scope to other developing countries in Southeast Asia.

Section 07

Conclusion and Implications

This study provides benchmark results for AQI classification using machine learning. The core conclusion is that gradient boosting models (especially LightGBM) balance accuracy and efficiency best in structured environmental data classification, which is of guiding significance for environmental monitoring departments with limited resources. The open-source implementation provides a reproducible reference for researchers, demonstrating the application potential of data science in the field of environmental sustainability.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54