Reading

Directional Motion Blindness in Video Large Models: A Study on Diagnosis and Repair Methods

This paper reveals the systematic defect of video large language models (Video-LLMs) in perceiving the direction of object motion, and proposes the DeltaDirect method to fix this issue by predicting normalized 2D motion vectors based on feature differences between frames.

视频大语言模型运动方向理解DeltaDirect跨模态对齐视频理解MoDirect数据集方向绑定缺口

Published 2026-05-22 01:59Recent activity 2026-05-22 13:22Estimated read 6 min

Directional Motion Blindness in Video Large Models: A Study on Diagnosis and Repair Methods

Section 01

Research Guide to Directional Motion Blindness in Video Large Models

This paper reveals that video large language models (Video-LLMs) suffer from "directional motion blindness"—they struggle to accurately judge the direction of object motion, with performance close to random guessing. Through diagnosis, the study finds that the root cause of the problem is the "direction binding gap" in cross-modal alignment (information exists inside the model but cannot be mapped to output vocabulary). The DeltaDirect method is proposed for repair, and the MoDirect dataset is constructed for evaluation. Experiments show that this method significantly improves the accuracy of direction judgment without affecting the original video understanding performance.

Section 02

Research Background: Basic Perceptual Blind Spots in Video Understanding

Video large language models have made significant progress in tasks such as video description and question answering in recent years, but they have basic perceptual defects—"directional motion blindness": their judgment of the motion direction of simple objects is close to random. This defect limits their application in scenarios like autonomous driving and motion analysis, and also reveals the fundamental lack of basic perceptual capabilities in current models.

Section 03

Problem Diagnosis: Locating Breakpoints in Information Flow

Using simple synthetic videos (single object moving in four directions), experiments found that the accuracy of mainstream Video-LLMs is about 25% (random level), and some high accuracy results from prediction bias rather than real understanding. By tracking the information flow in three stages—visual encoder, projection layer, and LLM—it was found that motion direction information exists in all stages (linearly separable) but cannot be bound to text output (direction binding gap), and the problem lies in the failure of cross-modal alignment.

Section 04

MoDirect Dataset: A Tool for Evaluating Motion Understanding Capabilities

The MoDirect dataset family is constructed, including two subsets: 1. MoDirect-SynBench (synthetic benchmark): programmatically generated, controlling variables such as motion direction, object type, and background to isolate the influence of factors; 2. MoDirect-RealBench (real-world benchmark): derived from public resources, covering real motion scenarios of vehicles, animals, etc., to verify generalization ability.

Section 05

DeltaDirect: A Diagnosis-Driven Repair Scheme

Optimized for the projection layer, the core mechanisms are: 1. Calculate the feature difference (Delta) of the projection layer between adjacent frames; 2. Predict normalized 2D motion vectors (direction corresponds to motion direction, size reflects saliency). Multi-task learning is adopted: the main task is video-text alignment, and the auxiliary task is motion vector prediction (MSE loss) to ensure no sacrifice of original performance.

Section 06

Experimental Results: Significant Repair and Generalization Ability

Synthetic data (MoDirect-SynBench): accuracy increased from 25.9% to 85.4%, stable under different objects, backgrounds, and speeds; 2. Real-world scenarios (MoDirect-RealBench): improved by 21.9 percentage points even without training on real data; 3. Standard benchmarks (MSR-VTT, etc.): maintained original performance or even slightly better.

Section 07

In-depth Analysis: Why DeltaDirect Works

Concept vector analysis: The motion direction concept vectors in the DeltaDirect model are more stable, and signals are not overwhelmed by noise in complex scenarios; 2. Attention pattern: The improved model focuses more on moving objects and their trajectories, enhancing temporal modeling capabilities.

Section 08

Research Insights and Future Exploration Directions

Insights: 1. Basic perceptual ability is a prerequisite for high-level understanding; 2. Explicit intermediate supervision is necessary for hard-to-emerge capabilities; 3. The "diagnose first, then repair" methodology is effective. Future directions: comprehensive evaluation of blind spots in other perceptual dimensions, adaptive motion understanding, multi-modal motion integration, and hardware-friendly implementation.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15