# Directional Motion Blindness in Video Large Models: A Study on Diagnosis and Repair Methods

> This paper reveals the systematic defect of video large language models (Video-LLMs) in perceiving the direction of object motion, and proposes the DeltaDirect method to fix this issue by predicting normalized 2D motion vectors based on feature differences between frames.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-21T17:59:56.000Z
- 最近活动: 2026-05-22T05:22:20.223Z
- 热度: 146.6
- 关键词: 视频大语言模型, 运动方向理解, DeltaDirect, 跨模态对齐, 视频理解, MoDirect数据集, 方向绑定缺口
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2605-22823v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2605-22823v1
- Markdown 来源: floors_fallback

---

## Research Guide to Directional Motion Blindness in Video Large Models

This paper reveals that video large language models (Video-LLMs) suffer from "directional motion blindness"—they struggle to accurately judge the direction of object motion, with performance close to random guessing. Through diagnosis, the study finds that the root cause of the problem is the "direction binding gap" in cross-modal alignment (information exists inside the model but cannot be mapped to output vocabulary). The DeltaDirect method is proposed for repair, and the MoDirect dataset is constructed for evaluation. Experiments show that this method significantly improves the accuracy of direction judgment without affecting the original video understanding performance.

## Research Background: Basic Perceptual Blind Spots in Video Understanding

Video large language models have made significant progress in tasks such as video description and question answering in recent years, but they have basic perceptual defects—"directional motion blindness": their judgment of the motion direction of simple objects is close to random. This defect limits their application in scenarios like autonomous driving and motion analysis, and also reveals the fundamental lack of basic perceptual capabilities in current models.

## Problem Diagnosis: Locating Breakpoints in Information Flow

Using simple synthetic videos (single object moving in four directions), experiments found that the accuracy of mainstream Video-LLMs is about 25% (random level), and some high accuracy results from prediction bias rather than real understanding. By tracking the information flow in three stages—visual encoder, projection layer, and LLM—it was found that motion direction information exists in all stages (linearly separable) but cannot be bound to text output (direction binding gap), and the problem lies in the failure of cross-modal alignment.

## MoDirect Dataset: A Tool for Evaluating Motion Understanding Capabilities

The MoDirect dataset family is constructed, including two subsets: 1. MoDirect-SynBench (synthetic benchmark): programmatically generated, controlling variables such as motion direction, object type, and background to isolate the influence of factors; 2. MoDirect-RealBench (real-world benchmark): derived from public resources, covering real motion scenarios of vehicles, animals, etc., to verify generalization ability.

## DeltaDirect: A Diagnosis-Driven Repair Scheme

Optimized for the projection layer, the core mechanisms are: 1. Calculate the feature difference (Delta) of the projection layer between adjacent frames; 2. Predict normalized 2D motion vectors (direction corresponds to motion direction, size reflects saliency). Multi-task learning is adopted: the main task is video-text alignment, and the auxiliary task is motion vector prediction (MSE loss) to ensure no sacrifice of original performance.

## Experimental Results: Significant Repair and Generalization Ability

1. Synthetic data (MoDirect-SynBench): accuracy increased from 25.9% to 85.4%, stable under different objects, backgrounds, and speeds; 2. Real-world scenarios (MoDirect-RealBench): improved by 21.9 percentage points even without training on real data; 3. Standard benchmarks (MSR-VTT, etc.): maintained original performance or even slightly better.

## In-depth Analysis: Why DeltaDirect Works

1. Concept vector analysis: The motion direction concept vectors in the DeltaDirect model are more stable, and signals are not overwhelmed by noise in complex scenarios; 2. Attention pattern: The improved model focuses more on moving objects and their trajectories, enhancing temporal modeling capabilities.

## Research Insights and Future Exploration Directions

Insights: 1. Basic perceptual ability is a prerequisite for high-level understanding; 2. Explicit intermediate supervision is necessary for hard-to-emerge capabilities; 3. The "diagnose first, then repair" methodology is effective. Future directions: comprehensive evaluation of blind spots in other perceptual dimensions, adaptive motion understanding, multi-modal motion integration, and hardware-friendly implementation.