Zing Forum

Reading

Estimating Global Surface Water Fraction from Satellite Microwave Data: An Analysis of an End-to-End Machine Learning Framework

A complete machine learning framework for estimating global Surface Water Fraction (SWF) from passive microwave radiometer data, covering the entire workflow including data preprocessing, exploratory analysis, model selection and hyperparameter optimization, SHAP interpretability analysis, etc.

地表水覆盖率被动微波机器学习遥感WindSatCIMRSHAP超参数优化地球科学
Published 2026-05-27 12:45Recent activity 2026-05-27 12:50Estimated read 7 min
Estimating Global Surface Water Fraction from Satellite Microwave Data: An Analysis of an End-to-End Machine Learning Framework
1

Section 01

[Introduction] Analysis of an End-to-End Machine Learning Framework for Estimating Global Surface Water Fraction from Satellite Microwave Data

This article introduces a complete end-to-end machine learning framework for estimating global Surface Water Fraction (SWF) from passive microwave radiometer data. The framework covers the entire workflow including data preprocessing, exploratory analysis, model selection and hyperparameter optimization, SHAP interpretability analysis, etc. Using WindSat radiometer brightness temperature data as a proxy, it provides a reference for data processing in the future Copernicus Imaging Microwave Radiometer (CIMR) mission. The framework source code is available on GitHub (link: https://github.com/marcvem2AED/ML-framework-for-SWF-retrieval) and was released on May 27, 2026.

2

Section 02

Research Background and Significance

Surface Water Fraction (SWF) is a key geophysical variable in flood monitoring, hydrological research, and climate studies. Traditional SWF estimation methods rely on physical models and usually process only single spatiotemporal observation points. With the advancement of the CIMR mission, more abundant passive microwave radiation data will be available in the future. This framework uses WindSat radiometer brightness temperature data as a proxy to develop a complete machine learning solution, providing a reference for CIMR data processing.

3

Section 03

Data Sources and Preprocessing

Core Datasets: 1. WindSat Daily TB Maps (provided by Remote Sensing Systems, 18.7GHz and 37GHz channels, 0.25° resolution); 2. LPDR v3.1 (provided by NSIDC/NTSG, including auxiliary data such as global daily SWF, soil moisture, and vegetation optical depth).

Preprocessing Workflow: Projection conversion (reproject LPDR from EASE-Grid v1 to WindSat's 0.25° geographic grid), data fusion (merge into Parquet format), feature engineering (calculate surface emissivity, atmospheric correction factors, and physical model SWF estimates as benchmarks).

4

Section 04

Model Development Workflow

The framework adopts a structured sequential modeling strategy:

  1. Physical Benchmark Model: Evaluate the performance of the Difference Ratio (DR) formula on the 2018 test set;
  2. Data Scaling Study: Compare 6 schemes including zero-value removal, Box-Cox transformation, and feature standardization;
  3. Model Selection: Benchmark gradient boosting trees (XGBoost, LightGBM, CatBoost) and linear models (Ridge, ElasticNet), and screen via lightweight hyperparameter optimization;
  4. Feature Engineering: Evaluate 18 candidate feature sets, prune using forward selection, RFECV, and SHAP analysis;
  5. Hyperparameter Optimization: Optuna Bayesian optimization (200 trials, 5-fold cross-validation);
  6. Interpretability Analysis: SHAP's global feature importance, Beeswarm plots, dependence plots, and local attribution;
  7. Error Analysis: Residual diagnosis, spatial error heatmaps, time-series errors, and stratified analysis.
5

Section 05

Spatio-Temporal Context Extension Experiments

The framework explores introducing spatial neighborhood and temporal history information to improve prediction accuracy, processes the context window around the target pixel through neural network architecture, and experimentally evaluates the value of spatio-temporal modeling.

6

Section 06

Technical Implementation Details

Experimental Environment: CPU (Intel Core i5-14600KF), Memory (32GB DDR5), GPU (NVIDIA GeForce RTX5060 Ti), Storage (1TB NVMe SSD), OS (Windows11 Home).

Dependency Environment: Two independent conda environments: the main environment includes numpy, pandas, xarray, scikit-learn, matplotlib, xgboost, optuna, shap, etc.; the GDAL dedicated environment includes gdal, rasterio (installed via conda-forge).

Usage: Run the Notebooks in order: 1-Data Preprocessing → 2-Exploratory Analysis →3-Model Training →4-Spatio-Temporal Context Experiments. Time split: 2017 for training, 2018 for testing.

7

Section 07

Summary and Insights

Core values of the framework: 1. Full workflow coverage (from raw data to model deployment); 2. Interpretability priority (SHAP analysis throughout); 3. Systematic verification (each decision verified via experiments); 4. Practical application orientation (time split and error analysis consider deployment scenarios). It is of great reference value for developers of remote sensing data analysis and machine learning applications in earth sciences.