# PixDLM: A Dual-Path Multimodal Reasoning Segmentation Model for UAV Scenarios

> A CVPR 2026 Highlight work proposed by the Xiamen University team, PixDLM addresses challenges in UAV scenarios such as small objects, large field of view, and high scene complexity through decoupling semantic reasoning and pixel perception dual paths, achieving leading performance on the DRSeg benchmark.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-20T04:04:17.000Z
- 最近活动: 2026-04-20T04:20:45.203Z
- 热度: 158.7
- 关键词: PixDLM, UAV推理分割, 多模态大模型, 无人机视觉, CVPR2026, 双路径架构, SAM 2.1, LLaVA, DRSeg数据集, 指代分割, 链式推理, 小目标检测
- 页面链接: https://www.zingnex.cn/en/forum/thread/pixdlm
- Canonical: https://www.zingnex.cn/forum/thread/pixdlm
- Markdown 来源: floors_fallback

---

## [Introduction] PixDLM: A Dual-Path Multimodal Reasoning Segmentation Model for UAV Scenarios

PixDLM, a CVPR 2026 Highlight work proposed by the Xiamen University team, addresses challenges in UAV scenarios such as small objects, large field of view, and high scene complexity through decoupling semantic reasoning and pixel perception dual paths, achieving leading performance on the DRSeg benchmark. The work also releases the first UAV reasoning segmentation dataset DRSeg, and has open-sourced model weights, code, and the dataset, providing a new solution for UAV visual understanding.

## Research Background and Task Definition of UAV Reasoning Segmentation

### Research Background
UAV aerial image analysis faces three major challenges: 1. 58.08% of instances are small objects accounting for less than 1% of the image area; 2. Flight heights of 30-100 meters lead to drastic fluctuations in target scales; 3. Dense geographic elements require understanding of spatial relationships and context. Traditional referential segmentation models struggle with complex reasoning instructions, while MLLMs lack pixel-level localization capabilities, spurring the "reasoning segmentation" direction.

### Task Definition
UAV reasoning segmentation is an instruction-driven pixel-level prediction task that requires models to understand complex instructions with logical reasoning, perform spatial/attribute reasoning, and output precise segmentation masks. Limitations of existing models: coupling of reasoning and perception, lack of training data, and poor consistency in long-chain reasoning.

## PixDLM Architecture: Core Innovation of Dual-Path Decoupling

The core of PixDLM is an explicitly decoupled dual-path design:
- **Semantic Reasoning Path**: Based on LLaVA-v1.6-Vicuna-7B, it is responsible for understanding instructions, chain reasoning, and generating structured queries.
- **Pixel-Level Visual Path**: Integrates SAM 2.1 and CLIP visual encoders to provide high-quality pixel features.
- **Dual-Path Collaboration**: A lightweight cross-path attention module enables "reasoning-guided perception" to dynamically adjust visually focused regions.

Technical innovations: Explicit decoupling, hierarchical fusion, and reasoning consistency constraints.

## DRSeg Dataset: The First Benchmark for UAV Reasoning Segmentation

### Statistics
| Attribute | Value |
|------|------|
| Number of images | 10,000 high-resolution UAV images |
| Instance masks | 10,000 precise annotations |
| Reasoning QA pairs | 10,000 chain reasoning annotations |
| Flight heights | Three levels (30m/60m/100m) |
| Small object ratio | 58.08% of instances are less than 1% of the image area |

### Distribution of Reasoning Types
Spatial reasoning (33.33%), attribute reasoning (33.34%), and scene-level reasoning (33.33%) are evenly distributed, enhancing generalization.

## Experimental Results: Leading Performance of PixDLM on DRSeg and General Benchmarks

### Advantages on DRSeg Benchmark
- Small object segmentation: IoU improved by over 15% (for instances with <1% area);
- Multi-height generalization: Performance fluctuation across 30/60/100m heights is <5%;
- Complex instructions: Success rate for reasoning with more than 3 steps is significantly improved.

### Ablation Experiments
- Removing dual-path decoupling: Small object performance drops by about 20%;
- Replacing SAM1.0 with SAM2.1: Boundary accuracy improves by 8%;
- Introducing CoT supervision: Success rate for complex instructions improves by 12%.

### Cross-Benchmark Generalization
Achieves the level of dedicated models on general referential segmentation benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg.

## Open Source and Applications: Deployment Potential and Future Directions of PixDLM

### Open Source Ecosystem
Pre-trained weights (HuggingFace), inference/training code, and the DRSeg dataset have been open-sourced.

### Application Scenarios
Emergency rescue (locating disaster-stricken targets), agricultural monitoring (crop health assessment), infrastructure inspection (anomaly detection), urban planning (spatial index analysis).

### Future Directions
- Expand the dataset to 100K+;
- Enhance long-chain reasoning (more than 5 steps);
- Lightweight for edge devices;
- Multi-UAV collaboration.

## Academic Contributions and Summary: Value and Significance of PixDLM

### Academic Contributions
1. Task innovation: First expansion of reasoning segmentation to UAV scenarios;
2. Architecture innovation: Dual-path decoupling provides new ideas for MLLM pixel-level tasks;
3. Data contribution: DRSeg fills the gap in UAV reasoning segmentation data.

### Summary
PixDLM addresses core challenges in UAV reasoning segmentation through dual-path decoupling, and its architectural paradigm can be referenced for other multimodal applications requiring precise localization. Open source and dataset availability will promote the practicalization of UAV intelligent analysis.