Zing Forum

Reading

SAGAI: An Intelligent Streetscape Assessment and Automatic Mapping System Based on Vision-Language Models

SAGAI is an open-source streetscape analysis workflow that integrates OpenStreetMap, Google Street View, vision-language models, and geospatial analysis to enable zero-shot, fully automated urban environment assessment and interactive mapping.

vision-language modelurban computinggeospatial AIOpenStreetMapGoogle Street Viewzero-shot learningcomputer visionurban planninggenerative AIVLM
Published 2026-05-26 16:41Recent activity 2026-05-26 16:49Estimated read 6 min
SAGAI: An Intelligent Streetscape Assessment and Automatic Mapping System Based on Vision-Language Models
1

Section 01

[Introduction] SAGAI: Core Introduction to the Generative AI-Based Intelligent Streetscape Assessment and Mapping System

SAGAI (Streetscape Analysis with Generative AI) is an open-source end-to-end workflow developed by Joan Perez and G. Fusco, published in the Geomatica journal. It integrates OpenStreetMap (OSM) street networks, Google Street View (GSV) images, and vision-language models (VLM) to achieve zero-shot, fully automated urban streetscape assessment and interactive mapping. Users only need to define an area and specify assessment criteria using natural language to generate a thematic map with scores, providing a flexible and efficient analysis tool for urban planning and other fields.

2

Section 02

Background and Limitations of Traditional Methods

Traditional urban environment assessment relies on expensive field surveys or manually annotated image datasets, which are time-consuming, labor-intensive, and costly. SAGAI was developed to address these pain points: by combining open geospatial data with generative AI, it can complete streetscape analysis without pre-training or manual annotation, lowering the entry barrier for research.

3

Section 03

Technical Architecture and Core Components

SAGAI v2.1 adopts a modular design (packaged in Colab notebooks) and consists of three layers:

  1. Geospatial Data Layer: OSM point sampling generator (extracts street networks and generates sampling points), GSV image downloader (captures multi-directional images, requires Google API key);
  2. Vision-Language Analysis Layer: UVLM (Universal VLM Loader, supports 11 model checkpoints, including features like 4-bit quantization, multi-task parallelism, consensus validation, and chain-of-thought reasoning), task configuration (defines assessment criteria via natural language prompts), analysis execution (batch processing and resume from breakpoints);
  3. Visualization Output Layer: Aggregation and mapping (uses GeoPandas/Folium to generate interactive HTML maps, supports multiple aggregation methods).
4

Section 04

Application Cases and Empirical Studies

SAGAI v1.0 includes two pilot studies:

  • Paillon Valley, Nice, France: Captures environmental quality differences across different sections, verifying applicability in real cities;
  • Penzing-Wolfersberg, Vienna, Austria: Handles suburban mixed landscapes (residential, industrial, green spaces), demonstrating the ability to analyze heterogeneous areas. Case data (except GSV images) has been released along with the GitHub repository, providing reproducible benchmarks.
5

Section 05

Technical Limitations and Considerations

SAGAI has the following limitations:

  1. Street View Timeliness: GSV images may lag behind real-world changes;
  2. VLM Bias: Models may inherit geographic biases from training data, leading to insufficient understanding of non-Western cities, and cannot capture non-visual features like smells or sounds;
  3. API Dependence: Google Maps API availability and cost limit large-scale applications;
  4. Privacy Considerations: High-resolution point-by-point analysis needs to comply with local privacy regulations.
6

Section 06

Summary and Paradigm Significance

SAGAI represents a paradigm shift in urban analysis from data-driven to prompt-driven: the same infrastructure supports any assessment dimension (e.g., walkability, architectural aesthetics) without retraining the model; natural language prompts make assessment criteria interpretable and easy to compare; open data and free resources (Colab) reduce costs. In the future, as multimodal models evolve, geospatial AI tools will become more powerful, and urban science may move toward the direction of 'prompt as analysis'.