Section 01
Research Guide to Directional Motion Blindness in Video Large Models
This paper reveals that video large language models (Video-LLMs) suffer from "directional motion blindness"—they struggle to accurately judge the direction of object motion, with performance close to random guessing. Through diagnosis, the study finds that the root cause of the problem is the "direction binding gap" in cross-modal alignment (information exists inside the model but cannot be mapped to output vocabulary). The DeltaDirect method is proposed for repair, and the MoDirect dataset is constructed for evaluation. Experiments show that this method significantly improves the accuracy of direction judgment without affecting the original video understanding performance.