Zing Forum

Reading

DualGeo: A Dual-Perspective Framework for Global Image Geolocation

This paper proposes the DualGeo two-stage framework, which fuses image and semantic segmentation features via bidirectional cross-attention, combined with geographic clustering reordering and LMM reasoning. It improves street-level and city-level geolocation accuracy by 3.6%-16.58% and 1.29%-8.77% respectively on the IM2GPS, IM2GPS3k, and YFCC4k benchmarks.

图像地理定位语义分割多模态融合对比学习地理聚类LMM推理IM2GPS视觉定位
Published 2026-04-28 20:00Recent activity 2026-04-29 10:53Estimated read 5 min
DualGeo: A Dual-Perspective Framework for Global Image Geolocation
1

Section 01

DualGeo: A Dual-Perspective Framework to Improve Global Image Geolocation Accuracy

This paper proposes the DualGeo two-stage framework, which fuses image and semantic segmentation features via bidirectional cross-attention, combined with geographic clustering reordering and LMM reasoning. It improves street-level (<1km) and city-level (<25km) geolocation accuracy by 3.6%-16.58% and 1.29%-8.77% respectively on the IM2GPS, IM2GPS3k, and YFCC4k benchmarks, providing a new approach for global image geolocation.

2

Section 02

Task Background: Two Major Challenges in Global Image Geolocation

Global image geolocation requires inferring the geographic coordinates of images from any location, with challenges spanning multiple scales from street level (meter-scale) to city level (kilometer-scale). Existing methods face two major challenges: 1. Visual features are sensitive to environmental changes (feature matching easily fails for the same location under different seasons, weather, or lighting conditions); 2. Lack of effective outlier filtering mechanisms—noisy retrieval candidates limit accuracy.

3

Section 03

Stage 1: Building Robust Geographic Representations via Multimodal Fusion and Contrastive Learning

The goal of Stage 1 is to establish a robust geographic representation space where semantically similar images are close to each other. Core strategies: 1. Multimodal feature fusion (image features capture visual details, semantic segmentation features capture robust semantic content); 2. Bidirectional cross-attention fusion (image→segmentation learns visual-to-semantic correspondences, segmentation→image learns semantic-to-visual correspondences); 3. Dual-perspective contrastive learning alignment (image-coordinate alignment, semantic-geographic association) to build a global retrieval database.

4

Section 04

Stage 2: Refining Geolocation Results via Geographic Clustering + LMM Reasoning

Stage 2 refines the retrieval results: 1. Geographic clustering reordering (identifies spatially coherent candidate groups, filters isolated outliers, and boosts the ranking of candidates from large clusters); 2. LMM reasoning decision-making (inputs query images, candidate satellite/street view images, and geographic context, outputs final coordinates based on visual similarity and geographic rationality, making up for the limitations of pure feature matching).

5

Section 05

Experimental Validation: Accuracy Improvements on Three Benchmarks

Evaluated on three benchmarks (IM2GPS, IM2GPS3k, YFCC4k), focusing on street-level (<1km) and city-level (<25km) accuracy: Street-level accuracy improved by 3.6%-16.58% (contributed by dual-perspective fusion and clustering post-processing); City-level accuracy improved by 1.29%-8.77% (demonstrating the framework's robustness).

6

Section 06

Technical Insights: Key Reasons for DualGeo's Performance Improvement

Core reasons for DualGeo's effectiveness: 1. Semantic-visual complementarity (fusing the advantages and disadvantages of both); 2. Spatial consistency constraints (using the uniqueness of real locations to filter outliers); 3. Hierarchical decision-making architecture (coarse-grained retrieval → clustering refinement → LMM reasoning, balancing efficiency and accuracy).

7

Section 07

Limitations and Future Directions: Areas for DualGeo Improvement

Current limitations: High computational cost (multimodal extraction + LMM reasoning), semantic segmentation error propagation, and need for clustering parameter tuning. Future directions: Efficient implementation (reducing cost without sacrificing accuracy), dynamic clustering thresholds, and temporal extension (utilizing image time information).