Reading

DeltaDirect: Addressing the "Motion Direction Blindness" Problem in Video-LLMs

This article introduces the DeltaDirect method, which addresses the fundamental flaw of Video-LLMs in perceiving the direction of object motion. The study finds that most Video-LLMs cannot accurately determine the left/right or up/down direction of object movement, and proposes repairing this "direction binding gap" by predicting the 2D motion vector of feature differences between adjacent frames through the projection layer.

Video-LLM运动方向感知DeltaDirect视频理解多模态大模型方向绑定缺口时序推理计算机视觉

Published 2026-05-22 01:59Recent activity 2026-05-22 21:51Estimated read 6 min

Section 01

DeltaDirect: Addressing the "Motion Direction Blindness" Problem in Video-LLMs

This article introduces the DeltaDirect method, which aims to address the fundamental flaw of Video-LLMs (Video Large Language Models) in perceiving the direction of object motion—"directional motion blindness". The study finds that most Video-LLMs cannot accurately determine the left/right or up/down direction of object movement, and the root cause lies in the "direction binding gap" (i.e., although the model implicitly encodes motion information, it cannot map it to discrete language concepts). DeltaDirect effectively repairs this gap by introducing an auxiliary objective function in the projection layer to predict the 2D motion vector of feature differences between adjacent frames, thereby improving the ability to perceive motion directions in real-world videos.

Section 02

Motion Direction Perception Defects of Video-LLMs and Their Root Causes

Video-LLMs have made significant progress in temporal tasks, but they suffer from "directional motion blindness": their accuracy in simple object motion direction tests is close to random (25%), and slightly higher results are mostly due to prediction biases rather than true understanding. By tracking information flow, the study finds that motion direction information is linearly decodable in the visual encoder, projection layer, and LLM hidden states, but the model cannot bind this information to language concepts like "left/right", which is the "direction binding gap".

Section 03

DeltaDirect: A Solution Using Auxiliary Objective Function in the Projection Layer

To address the poor generalization of training with synthetic data, DeltaDirect introduces an auxiliary objective function in the projection layer: explicitly predicting the normalized 2D motion vector encoded by the feature difference between adjacent frames. The core idea is to retain and strengthen the motion direction signal in the visual encoder. Through an auxiliary prediction head that receives the projected feature difference between adjacent frames, it outputs a 2D motion vector, which is jointly optimized with the language modeling objective to establish a robust direction perception mechanism.

Section 04

Experimental Validation of DeltaDirect's Effectiveness

On the real-world video benchmark MoDirect-RealBench, DeltaDirect increases the motion direction accuracy by 21.9 percentage points without using real training data. At the same time, it maintains performance comparable to or slightly better than the baseline on 8 spatial reasoning and general video question-answering benchmarks, indicating a positive correlation between enhanced motion direction perception and overall understanding ability. Additionally, it achieves the current state-of-the-art level on the ScanNet streaming pose estimation task.

Section 05

Value of the Diagnosis-Driven Research Paradigm

DeltaDirect embodies the "diagnosis → repair" research paradigm: first, locate the failure point (direction binding gap) through systematic tracking (e.g., linear probing), then design a targeted solution. This paradigm avoids blind parameter tuning, and tools like linear probing can locate information bottlenecks. Meanwhile, the design of explicit auxiliary tasks helps learn robust and transferable representations, which is better than pure end-to-end training.

Section 06

Current Limitations and Future Research Directions

The limitations of DeltaDirect include: it only targets 2D planar motion and does not involve the 3D depth direction; it focuses on single-object motion, and its applicability to multi-object scenarios needs to be verified. Future directions can explore 3D motion perception, expansion to multi-object scenarios, and the application of this methodology to the diagnosis and repair of other temporal perception defects (such as event order and causal relationships).

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15