# DeltaDirect: Addressing the "Motion Direction Blindness" Problem in Video-LLMs

> This article introduces the DeltaDirect method, which addresses the fundamental flaw of Video-LLMs in perceiving the direction of object motion. The study finds that most Video-LLMs cannot accurately determine the left/right or up/down direction of object movement, and proposes repairing this "direction binding gap" by predicting the 2D motion vector of feature differences between adjacent frames through the projection layer.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-21T17:59:56.000Z
- 最近活动: 2026-05-22T13:51:17.471Z
- 热度: 122.1
- 关键词: Video-LLM, 运动方向感知, DeltaDirect, 视频理解, 多模态大模型, 方向绑定缺口, 时序推理, 计算机视觉
- 页面链接: https://www.zingnex.cn/en/forum/thread/deltadirect
- Canonical: https://www.zingnex.cn/forum/thread/deltadirect
- Markdown 来源: floors_fallback

---

## DeltaDirect: Addressing the "Motion Direction Blindness" Problem in Video-LLMs

This article introduces the DeltaDirect method, which aims to address the fundamental flaw of Video-LLMs (Video Large Language Models) in perceiving the direction of object motion—"directional motion blindness". The study finds that most Video-LLMs cannot accurately determine the left/right or up/down direction of object movement, and the root cause lies in the "direction binding gap" (i.e., although the model implicitly encodes motion information, it cannot map it to discrete language concepts). DeltaDirect effectively repairs this gap by introducing an auxiliary objective function in the projection layer to predict the 2D motion vector of feature differences between adjacent frames, thereby improving the ability to perceive motion directions in real-world videos.

## Motion Direction Perception Defects of Video-LLMs and Their Root Causes

Video-LLMs have made significant progress in temporal tasks, but they suffer from "directional motion blindness": their accuracy in simple object motion direction tests is close to random (25%), and slightly higher results are mostly due to prediction biases rather than true understanding. By tracking information flow, the study finds that motion direction information is linearly decodable in the visual encoder, projection layer, and LLM hidden states, but the model cannot bind this information to language concepts like "left/right", which is the "direction binding gap".

## DeltaDirect: A Solution Using Auxiliary Objective Function in the Projection Layer

To address the poor generalization of training with synthetic data, DeltaDirect introduces an auxiliary objective function in the projection layer: explicitly predicting the normalized 2D motion vector encoded by the feature difference between adjacent frames. The core idea is to retain and strengthen the motion direction signal in the visual encoder. Through an auxiliary prediction head that receives the projected feature difference between adjacent frames, it outputs a 2D motion vector, which is jointly optimized with the language modeling objective to establish a robust direction perception mechanism.

## Experimental Validation of DeltaDirect's Effectiveness

On the real-world video benchmark MoDirect-RealBench, DeltaDirect increases the motion direction accuracy by 21.9 percentage points without using real training data. At the same time, it maintains performance comparable to or slightly better than the baseline on 8 spatial reasoning and general video question-answering benchmarks, indicating a positive correlation between enhanced motion direction perception and overall understanding ability. Additionally, it achieves the current state-of-the-art level on the ScanNet streaming pose estimation task.

## Value of the Diagnosis-Driven Research Paradigm

DeltaDirect embodies the "diagnosis → repair" research paradigm: first, locate the failure point (direction binding gap) through systematic tracking (e.g., linear probing), then design a targeted solution. This paradigm avoids blind parameter tuning, and tools like linear probing can locate information bottlenecks. Meanwhile, the design of explicit auxiliary tasks helps learn robust and transferable representations, which is better than pure end-to-end training.

## Current Limitations and Future Research Directions

The limitations of DeltaDirect include: it only targets 2D planar motion and does not involve the 3D depth direction; it focuses on single-object motion, and its applicability to multi-object scenarios needs to be verified. Future directions can explore 3D motion perception, expansion to multi-object scenarios, and the application of this methodology to the diagnosis and repair of other temporal perception defects (such as event order and causal relationships).
