# Machine Learning Framework for Flood Prediction in Malaysia Based on NASA Satellite Data

> This project uses NASA POWER MERRA-2 satellite reanalysis data to build a flood and flash flood prediction system for 8 major cities in Malaysia, comparing the performance of three machine learning models: Decision Tree, Random Forest, and XGBoost.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-09T12:56:45.000Z
- 最近活动: 2026-05-09T13:01:59.785Z
- 热度: 163.9
- 关键词: 洪水预测, 机器学习, XGBoost, NASA卫星数据, 马来西亚, 灾害预警, 随机森林, 决策树, 气象数据, 类别不平衡
- 页面链接: https://www.zingnex.cn/en/forum/thread/nasa
- Canonical: https://www.zingnex.cn/forum/thread/nasa
- Markdown 来源: floors_fallback

---

## Guide to the Machine Learning Framework for Flood Prediction in Malaysia Based on NASA Satellite Data

This study uses NASA POWER MERRA-2 satellite reanalysis data to build a flood and flash flood prediction system covering 8 major cities in Malaysia, comparing the performance of three models: Decision Tree, Random Forest, and XGBoost. Core findings include: XGBoost performs best in flood and flash flood prediction (AUC-ROC of 0.9824 and 0.9651 respectively); Johor Bahru has the highest flood risk, with 20 times higher risk during the Northeast Monsoon than other months; 3-day rolling average rainfall is the key predictive factor. The research results can support disaster warning, insurance pricing, and urban planning.

## Research Background and Significance

Malaysia's tropical monsoon climate leads to frequent floods on the east coast from November to January each year. The severe floods in 2014 and 2021 caused billions of ringgit in losses and casualties. Traditional warnings rely on ground station data, which is limited in remote areas. This study proposes a satellite data + machine learning framework to弥补 gaps in ground observations and provide reliable warnings for resource-limited regions.

## Data Foundation and Feature Engineering

**Data Source**: NASA POWER MERRA-2 dataset (January 2010 to March 2026, 16 years, 47,367 daily records), covering 8 cities (East Coast: Kota Bharu, Kuantan; South: Johor Bahru, Melaka; Central: Kuala Lumpur, Shah Alam; East Malaysia: Kuching, Kota Kinabalu). **Feature Design**: Basic meteorology (temperature, humidity, wind speed), rolling statistics (3/7/14-day average rainfall, 7-day cumulative rainfall), time features (month, monsoon indicator). **Label Definition**: Flood (daily rainfall ≥50mm, accounting for 1.43%), flash flood (≥80mm, accounting for 0.48%)—extreme class imbalance. Note: Daily rainfall is only used for labels to avoid data leakage.

## Model Methods and Experimental Design

**Compared Models**: Decision Tree (baseline, interpretable but prone to overfitting), Random Forest (ensemble to reduce variance), XGBoost (gradient boosting with regularization to control complexity). **Evaluation Metrics**: Accuracy, precision, recall, F1-score, AUC-ROC. Recall is prioritized in disaster warning (higher cost of missed alerts).

## Experimental Results and Key Findings

**Flood Prediction (≥50mm)**: XGBoost is optimal (F1=0.2979, AUC=0.9824); Decision Tree has high recall (80.49%) but low precision; Random Forest has high precision but low recall. **Flash Flood Prediction (≥80mm)**: XGBoost leads in AUC (0.9651). **Key Findings**: 1. Johor Bahru has the highest flood risk (2.85%), 8 times that of Melaka; 2. Risk during the Northeast Monsoon is 20 times higher than other months; 3. 3-day rolling average rainfall is the most important feature; 4. All models have AUC ≥0.88, indicating they learned real patterns.

## Practical Application Value

1. **Disaster Warning**: Integrate into the NADMA system, use real-time satellite data to predict high-risk areas in advance, supporting evacuation and resource allocation; 2. **Insurance Finance**: Used for risk assessment and premium pricing, combined with GIS to draw detailed risk maps; 3. **Urban Planning**: Guide flood control infrastructure (e.g., strengthen drainage systems in Johor Bahru), focus on maintenance before the monsoon.

## Limitations and Improvement Directions

**Limitations**: MERRA-2 has a spatial resolution of 50 km (insufficient for small-scale events); only 226 flash flood samples; no consideration of terrain and other geographic factors. **Improvements**: Integrate GPM high-resolution data; try LSTM/Transformer time-series models; fuse multi-source data (radar, water level, social media); develop real-time API services.

## Research Conclusions

This study verifies the potential of satellite data + machine learning in flood prediction. The XGBoost model performs best and is suitable as the core algorithm for warning systems. The results provide replicable solutions for Malaysia and countries with similar climates, and are expected to reduce flood losses.
