Zing Forum

Reading

PixDLM: A Dual-Path Multimodal Reasoning Segmentation Model for UAV Scenarios

A CVPR 2026 Highlight work proposed by the Xiamen University team, PixDLM addresses challenges in UAV scenarios such as small objects, large field of view, and high scene complexity through decoupling semantic reasoning and pixel perception dual paths, achieving leading performance on the DRSeg benchmark.

PixDLMUAV推理分割多模态大模型无人机视觉CVPR2026双路径架构SAM 2.1LLaVADRSeg数据集指代分割
Published 2026-04-20 12:04Recent activity 2026-04-20 12:20Estimated read 8 min
PixDLM: A Dual-Path Multimodal Reasoning Segmentation Model for UAV Scenarios
1

Section 01

[Introduction] PixDLM: A Dual-Path Multimodal Reasoning Segmentation Model for UAV Scenarios

PixDLM, a CVPR 2026 Highlight work proposed by the Xiamen University team, addresses challenges in UAV scenarios such as small objects, large field of view, and high scene complexity through decoupling semantic reasoning and pixel perception dual paths, achieving leading performance on the DRSeg benchmark. The work also releases the first UAV reasoning segmentation dataset DRSeg, and has open-sourced model weights, code, and the dataset, providing a new solution for UAV visual understanding.

2

Section 02

Research Background and Task Definition of UAV Reasoning Segmentation

Research Background

UAV aerial image analysis faces three major challenges: 1. 58.08% of instances are small objects accounting for less than 1% of the image area; 2. Flight heights of 30-100 meters lead to drastic fluctuations in target scales; 3. Dense geographic elements require understanding of spatial relationships and context. Traditional referential segmentation models struggle with complex reasoning instructions, while MLLMs lack pixel-level localization capabilities, spurring the "reasoning segmentation" direction.

Task Definition

UAV reasoning segmentation is an instruction-driven pixel-level prediction task that requires models to understand complex instructions with logical reasoning, perform spatial/attribute reasoning, and output precise segmentation masks. Limitations of existing models: coupling of reasoning and perception, lack of training data, and poor consistency in long-chain reasoning.

3

Section 03

PixDLM Architecture: Core Innovation of Dual-Path Decoupling

The core of PixDLM is an explicitly decoupled dual-path design:

  • Semantic Reasoning Path: Based on LLaVA-v1.6-Vicuna-7B, it is responsible for understanding instructions, chain reasoning, and generating structured queries.
  • Pixel-Level Visual Path: Integrates SAM 2.1 and CLIP visual encoders to provide high-quality pixel features.
  • Dual-Path Collaboration: A lightweight cross-path attention module enables "reasoning-guided perception" to dynamically adjust visually focused regions.

Technical innovations: Explicit decoupling, hierarchical fusion, and reasoning consistency constraints.

4

Section 04

DRSeg Dataset: The First Benchmark for UAV Reasoning Segmentation

Statistics

Attribute Value
Number of images 10,000 high-resolution UAV images
Instance masks 10,000 precise annotations
Reasoning QA pairs 10,000 chain reasoning annotations
Flight heights Three levels (30m/60m/100m)
Small object ratio 58.08% of instances are less than 1% of the image area

Distribution of Reasoning Types

Spatial reasoning (33.33%), attribute reasoning (33.34%), and scene-level reasoning (33.33%) are evenly distributed, enhancing generalization.

5

Section 05

Experimental Results: Leading Performance of PixDLM on DRSeg and General Benchmarks

Advantages on DRSeg Benchmark

  • Small object segmentation: IoU improved by over 15% (for instances with <1% area);
  • Multi-height generalization: Performance fluctuation across 30/60/100m heights is <5%;
  • Complex instructions: Success rate for reasoning with more than 3 steps is significantly improved.

Ablation Experiments

  • Removing dual-path decoupling: Small object performance drops by about 20%;
  • Replacing SAM1.0 with SAM2.1: Boundary accuracy improves by 8%;
  • Introducing CoT supervision: Success rate for complex instructions improves by 12%.

Cross-Benchmark Generalization

Achieves the level of dedicated models on general referential segmentation benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg.

6

Section 06

Open Source and Applications: Deployment Potential and Future Directions of PixDLM

Open Source Ecosystem

Pre-trained weights (HuggingFace), inference/training code, and the DRSeg dataset have been open-sourced.

Application Scenarios

Emergency rescue (locating disaster-stricken targets), agricultural monitoring (crop health assessment), infrastructure inspection (anomaly detection), urban planning (spatial index analysis).

Future Directions

  • Expand the dataset to 100K+;
  • Enhance long-chain reasoning (more than 5 steps);
  • Lightweight for edge devices;
  • Multi-UAV collaboration.
7

Section 07

Academic Contributions and Summary: Value and Significance of PixDLM

Academic Contributions

  1. Task innovation: First expansion of reasoning segmentation to UAV scenarios;
  2. Architecture innovation: Dual-path decoupling provides new ideas for MLLM pixel-level tasks;
  3. Data contribution: DRSeg fills the gap in UAV reasoning segmentation data.

Summary

PixDLM addresses core challenges in UAV reasoning segmentation through dual-path decoupling, and its architectural paradigm can be referenced for other multimodal applications requiring precise localization. Open source and dataset availability will promote the practicalization of UAV intelligent analysis.