# Classification of Air Quality Index in Philippine Cities: A Comparative Study of Four Machine Learning Models

> A study on the classification of Air Quality Index (AQI) in Philippine urban environments, comparing four models—SVM, LightGBM, CatBoost, and MLP neural network—using pollutant concentration data, with LightGBM showing the best performance.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-19T14:15:15.000Z
- 最近活动: 2026-05-19T14:18:31.478Z
- 热度: 154.9
- 关键词: 空气质量指数, 机器学习, LightGBM, CatBoost, SVM, 神经网络, 菲律宾, 环境数据科学, 分类算法, 梯度提升
- 页面链接: https://www.zingnex.cn/en/forum/thread/geo-github-averagecoder-byte-ph-aqi-classification-ml
- Canonical: https://www.zingnex.cn/forum/thread/geo-github-averagecoder-byte-ph-aqi-classification-ml
- Markdown 来源: floors_fallback

---

## [Introduction] Classification of AQI in Philippine Cities: A Comparative Study of Four Machine Learning Models

This study focuses on the classification of Air Quality Index (AQI) in Philippine cities, comparing the performance of four models: Support Vector Machine (SVM), LightGBM, CatBoost, and Multi-Layer Perceptron (MLP) neural network. The key finding is that LightGBM performs best in the classification task of structured environmental sensor data, providing technical references for environmental monitoring departments and a reusable analytical framework for similar cities in other developing countries.

## Research Background and Motivation

Air pollution has become a major environmental and public health challenge in urban areas of the Philippines. Accelerated urbanization has worsened air quality due to industrial emissions, traffic exhaust, and other factors, affecting residents' health. While traditional monitoring can provide real-time data, it lacks intelligent analysis and prediction capabilities. This project focuses on AQI classification in major Philippine cities, comparing the performance of different algorithms through machine learning modeling, aiming to provide technical references for environmental monitoring departments and an analytical framework for similar cities in other developing countries.

## Dataset and Model Methods

### Dataset
The dataset used is "PH Philippine Cities Air Quality Index Data 2025" from Kaggle, which includes monitoring records of multiple cities in 2025, preprocessed and merged into a unified dataset. Key variables include pollutant concentrations (CO, NO, NO2, O3, SO2, PM2.5, PM10, NH3), time features (month, day of the week, hour), geographic features (city name), and the target variable is AQI classification (levels 1-5, a multi-class problem).

### Model Architecture
Selected four representative models:
1. **SVM**: A baseline model with solid theory but low training efficiency on large-scale data;
2. **LightGBM**: A gradient boosting framework by Microsoft, optimized for training speed and memory efficiency using histogram algorithms;
3. **CatBoost**: An open-source library by Yandex, optimized for handling categorical features;
4. **MLP Neural Network**: A fully connected structure using Adam optimizer and cross-entropy loss.

### Experimental Workflow
Following the standard workflow: data acquisition → integration → cleaning → feature engineering → stratified sampling (70/15/15) → model training → evaluation (metrics such as accuracy, macro-average F1 score).

## Experimental Results and Analysis

Performance comparison of each model:
| Model | Accuracy | Macro-average F1 |
|------|--------|----------|
| LightGBM | 0.9969 | 0.9704 |
| CatBoost | 0.9926 | 0.9338 |
| MLP Neural Network | 0.9493 | 0.8381 |
| SVM | 0.8461 | 0.6010 |

Analysis: Gradient boosting models (LightGBM, CatBoost) dominated the performance, with LightGBM being the best; MLP performed moderately; SVM had limitations in multi-class imbalance problems. Feature importance analysis showed that pollutant concentrations (PM2.5, PM10, etc.) contributed significantly more than time and geographic features.

## Technical Implementation and Reproducibility

The project provides a complete reproducible solution:
- Dependency management: requirements.txt defines Python dependencies;
- Interactive Notebook: includes the full workflow from data download to evaluation;
- Document output: PDF reports and research papers;
- Result archiving: models, metrics, and visualizations are stored in the outputs directory.

A fixed random seed is used to ensure result reproducibility, and stratified sampling ensures consistent class ratios in the training/validation/test sets.

## Research Limitations and Future Directions

### Limitations
1. AQI labels are derived from the OpenWeather API; the model reproduces this rule rather than being an independent physical model;
2. The dataset only covers 2025, lacking cross-year analysis.

### Future Directions
1. Introduce time-series models (LSTM, Transformer) to capture dynamic changes;
2. Integrate meteorological data (temperature, humidity, wind speed);
3. Expand the research scope to other developing countries in Southeast Asia.

## Conclusion and Implications

This study provides benchmark results for AQI classification using machine learning. The core conclusion is that gradient boosting models (especially LightGBM) balance accuracy and efficiency best in structured environmental data classification, which is of guiding significance for environmental monitoring departments with limited resources. The open-source implementation provides a reproducible reference for researchers, demonstrating the application potential of data science in the field of environmental sustainability.
