Zing Forum

Reading

When Satellites Meet Rivers: Predicting Urban River Water Quality Using Machine Learning and Sentinel-2 Data

This article introduces a study combining Sentinel-2 Earth observation data with machine learning to predict water quality parameters of the Roding River in London by analyzing watershed-scale spectral and land cover features. It demonstrates the application potential and limitations of remote sensing technology in urban water environment monitoring.

Sentinel-2机器学习水质监测遥感随机森林SHAP可解释性地球观测环境监测
Published 2026-05-24 20:15Recent activity 2026-05-24 20:21Estimated read 9 min
When Satellites Meet Rivers: Predicting Urban River Water Quality Using Machine Learning and Sentinel-2 Data
1

Section 01

[Introduction] When Satellites Meet Rivers: Core Research on Predicting Urban River Water Quality with Sentinel-2 and Machine Learning

Title: When Satellites Meet Rivers: Predicting Urban River Water Quality Using Machine Learning and Sentinel-2 Data Core Point: A team from University College London (UCL) conducted a study combining Sentinel-2 Earth observation data with machine learning (Random Forest, Ridge Regression) to indirectly predict water quality parameters (e.g., conductivity, sodium concentration, pH) of the Roding River in London by analyzing watershed-scale spectral and land cover features. The study uses the SHAP method to explain the model, clarifies its application potential (low cost, wide coverage) and limitations (signal attenuation in narrow channels, model failure under tidal influence), and emphasizes the importance of understanding scientific boundaries. Original Author Info: James Ge (UCL Department of Earth Sciences), Project Source: GitHub (Sentinel2-Roding-Water-Quality-ML), Publication Date: May 24, 2026

2

Section 02

Research Background: Why Monitor Urban Rivers from Space?

Research Background

Water is the lifeline of civilization, but urbanization alters the hydrochemical characteristics of urban rivers. Traditional monitoring relies on on-site sampling, which is accurate but struggles to cover wide areas and high-frequency dynamic monitoring. The Sentinel-2 satellite (10m resolution, 5-day revisit cycle) has revolutionized environmental monitoring, but narrow urban rivers (10-30m) make direct acquisition of channel spectral signals difficult. Research Idea: Indirectly infer water quality by analyzing spectral features of the watershed environment around the river, integrating remote sensing and machine learning.

3

Section 03

Study Area: Urbanization Gradient and Sampling Design of the Roding River in London

Study Area and Sampling Design

The Roding River flows from Loughton in Essex to Barking Creek, where it joins the Thames River, passing through an urbanization gradient of semi-natural woodland (upper reaches), suburban residential areas (middle reaches), and industrialized urban areas (lower reaches). Sampling: Data from 38 points were collected during the summer dry season (Aug-Oct 2025) and winter wet season (Dec 2025-Jan 2026), with 15 points undergoing ICP-OES elemental analysis (sodium, calcium, etc.). Special Treatment: Estuarine sites (influenced by Thames tides, conductivity >1800µS/cm) were excluded from the training set and used for out-of-domain evaluation of model boundaries.

4

Section 04

Methodology: From Sentinel-2 Data to Machine Learning Models

Methodology

  1. Data Preprocessing: Use Sentinel-2 Level2A surface reflectance data, cropped to the Roding River watershed.
  2. Spectral Indices: Select three indices—NDVI (vegetation density), NDWI (water body identification), NDBI (impervious surfaces)—combined with season (summer/winter) and along-river position variables to form 7 features.
  3. Models and Validation: Compare Random Forest (200 trees) and Ridge Regression; use leave-one-out cross-validation (due to small sample size, to avoid insufficient representativeness of the test set).
5

Section 05

SHAP Interpretability: Opening the Black Box of Machine Learning

SHAP Interpretability

In environmental science, model interpretation is more critical than accuracy. SHAP is based on game-theoretic Shapley values, assigning marginal contributions to each feature for prediction. Research Hypothesis: NDBI dominates conductivity/sodium concentration prediction (impervious surfaces increase ionic runoff), while pH prediction has no dominant feature (controlled by geological buffering). Significance: Verify physical mechanism hypotheses through explainable AI, enhancing the scientific credibility of the model.

6

Section 06

Research Results: Prediction Performance and Model Boundaries

Research Results

  1. Prediction Performance: Conductivity prediction is the best (Ridge Regression slightly outperforms Random Forest, with an approximately linear relationship); sodium concentration prediction is weak (small sample size + hydrological mixing effects); pH is almost unpredictable (dominated by geological buffering).
  2. Feature Ablation: Spatial position along the river explains more conductivity variation than Sentinel-2 features, as watershed spectral signals attenuate in narrow river systems.
  3. Seasonality: Prediction performance is better in summer than winter (higher ionic concentration in dry season leads to stronger signals).
  4. Out-of-Domain Evaluation: The model trained on freshwater fails at estuarine sites, proving it only applies to land-use-driven freshwater hydrochemistry and cannot resolve tidal mixing processes.
7

Section 07

Environmental Significance and Technical Insights: Potential, Limitations, and Future Directions

Environmental Significance and Technical Insights

Application Prospects: Provides a low-cost, wide-coverage supplementary method for watershed water quality monitoring, especially valuable for developing countries lacking ground monitoring networks. Limitations: Signal attenuation due to narrow channel geometry; difficulty capturing hydrological mixing processes like tides; seasonal effects on prediction performance. Technical Insights: Integrate physical constraints; emphasize out-of-domain evaluation to define model boundaries; explainable AI should be a standard component; future exploration can include multi-source data fusion (hyperspectral, commercial satellites, hydrological models).