Section 01
DeltaDirect: Addressing the "Motion Direction Blindness" Problem in Video-LLMs
This article introduces the DeltaDirect method, which aims to address the fundamental flaw of Video-LLMs (Video Large Language Models) in perceiving the direction of object motion—"directional motion blindness". The study finds that most Video-LLMs cannot accurately determine the left/right or up/down direction of object movement, and the root cause lies in the "direction binding gap" (i.e., although the model implicitly encodes motion information, it cannot map it to discrete language concepts). DeltaDirect effectively repairs this gap by introducing an auxiliary objective function in the projection layer to predict the 2D motion vector of feature differences between adjacent frames, thereby improving the ability to perceive motion directions in real-world videos.