# Estimating Global Surface Water Fraction from Satellite Microwave Data: An Analysis of an End-to-End Machine Learning Framework

> A complete machine learning framework for estimating global Surface Water Fraction (SWF) from passive microwave radiometer data, covering the entire workflow including data preprocessing, exploratory analysis, model selection and hyperparameter optimization, SHAP interpretability analysis, etc.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-27T04:45:28.000Z
- 最近活动: 2026-05-27T04:50:15.036Z
- 热度: 152.9
- 关键词: 地表水覆盖率, 被动微波, 机器学习, 遥感, WindSat, CIMR, SHAP, 超参数优化, 地球科学
- 页面链接: https://www.zingnex.cn/en/forum/thread/geo-github-marcvem2aed-ml-framework-for-swf-retrieval
- Canonical: https://www.zingnex.cn/forum/thread/geo-github-marcvem2aed-ml-framework-for-swf-retrieval
- Markdown 来源: floors_fallback

---

## [Introduction] Analysis of an End-to-End Machine Learning Framework for Estimating Global Surface Water Fraction from Satellite Microwave Data

This article introduces a complete end-to-end machine learning framework for estimating global Surface Water Fraction (SWF) from passive microwave radiometer data. The framework covers the entire workflow including data preprocessing, exploratory analysis, model selection and hyperparameter optimization, SHAP interpretability analysis, etc. Using WindSat radiometer brightness temperature data as a proxy, it provides a reference for data processing in the future Copernicus Imaging Microwave Radiometer (CIMR) mission. The framework source code is available on GitHub (link: https://github.com/marcvem2AED/ML-framework-for-SWF-retrieval) and was released on May 27, 2026.

## Research Background and Significance

Surface Water Fraction (SWF) is a key geophysical variable in flood monitoring, hydrological research, and climate studies. Traditional SWF estimation methods rely on physical models and usually process only single spatiotemporal observation points. With the advancement of the CIMR mission, more abundant passive microwave radiation data will be available in the future. This framework uses WindSat radiometer brightness temperature data as a proxy to develop a complete machine learning solution, providing a reference for CIMR data processing.

## Data Sources and Preprocessing

**Core Datasets**: 1. WindSat Daily TB Maps (provided by Remote Sensing Systems, 18.7GHz and 37GHz channels, 0.25° resolution); 2. LPDR v3.1 (provided by NSIDC/NTSG, including auxiliary data such as global daily SWF, soil moisture, and vegetation optical depth).

**Preprocessing Workflow**: Projection conversion (reproject LPDR from EASE-Grid v1 to WindSat's 0.25° geographic grid), data fusion (merge into Parquet format), feature engineering (calculate surface emissivity, atmospheric correction factors, and physical model SWF estimates as benchmarks).

## Model Development Workflow

The framework adopts a structured sequential modeling strategy:
1. **Physical Benchmark Model**: Evaluate the performance of the Difference Ratio (DR) formula on the 2018 test set;
2. **Data Scaling Study**: Compare 6 schemes including zero-value removal, Box-Cox transformation, and feature standardization;
3. **Model Selection**: Benchmark gradient boosting trees (XGBoost, LightGBM, CatBoost) and linear models (Ridge, ElasticNet), and screen via lightweight hyperparameter optimization;
4. **Feature Engineering**: Evaluate 18 candidate feature sets, prune using forward selection, RFECV, and SHAP analysis;
5. **Hyperparameter Optimization**: Optuna Bayesian optimization (200 trials, 5-fold cross-validation);
6. **Interpretability Analysis**: SHAP's global feature importance, Beeswarm plots, dependence plots, and local attribution;
7. **Error Analysis**: Residual diagnosis, spatial error heatmaps, time-series errors, and stratified analysis.

## Spatio-Temporal Context Extension Experiments

The framework explores introducing spatial neighborhood and temporal history information to improve prediction accuracy, processes the context window around the target pixel through neural network architecture, and experimentally evaluates the value of spatio-temporal modeling.

## Technical Implementation Details

**Experimental Environment**: CPU (Intel Core i5-14600KF), Memory (32GB DDR5), GPU (NVIDIA GeForce RTX5060 Ti), Storage (1TB NVMe SSD), OS (Windows11 Home).

**Dependency Environment**: Two independent conda environments: the main environment includes numpy, pandas, xarray, scikit-learn, matplotlib, xgboost, optuna, shap, etc.; the GDAL dedicated environment includes gdal, rasterio (installed via conda-forge).

**Usage**: Run the Notebooks in order: 1-Data Preprocessing → 2-Exploratory Analysis →3-Model Training →4-Spatio-Temporal Context Experiments. Time split: 2017 for training, 2018 for testing.

## Summary and Insights

Core values of the framework: 1. Full workflow coverage (from raw data to model deployment); 2. Interpretability priority (SHAP analysis throughout); 3. Systematic verification (each decision verified via experiments); 4. Practical application orientation (time split and error analysis consider deployment scenarios). It is of great reference value for developers of remote sensing data analysis and machine learning applications in earth sciences.
