Zing Forum

Reading

Minerva-Ego: Spatiotemporal Cues Empower a New Benchmark for First-Person Video Understanding

This article introduces the Minerva-Ego benchmark, which evaluates first-person video reasoning capabilities through multi-step multimodal questions and spatiotemporal dense human reasoning trajectories. It finds that "when" (temporal) and "where" (spatial) cues significantly improve model performance.

第一人称视频具身智能时空推理视频理解基准测试视觉问答多模态
Published 2026-05-15 03:12Recent activity 2026-05-18 11:24Estimated read 8 min
Minerva-Ego: Spatiotemporal Cues Empower a New Benchmark for First-Person Video Understanding
1

Section 01

Introduction: Overview of the Minerva-Ego Benchmark

Introduction: Overview of the Minerva-Ego Benchmark

Minerva-Ego is a new benchmark for first-person video understanding, evaluating models' reasoning capabilities through multi-step multimodal questions and spatiotemporal dense human reasoning trajectories. The core finding is that providing "when" (temporal localization) and "where" (spatial localization) cues significantly improves model performance, offering important directions for model design and training in this field.

2

Section 02

Research Background: Challenges in First-Person Video Understanding

Research Background: Challenges in First-Person Video Understanding

First-person perspective videos have unique value in scenarios like robot learning, assistive technology, action recognition, and augmented reality, but existing evaluation benchmarks have limitations:

  1. Output-oriented evaluation: Only focuses on final answers, ignoring intermediate reasoning processes;
  2. Single-modal output: Lacks spatial/temporal localization information;
  3. Lack of fine-grained annotations: Makes it difficult to analyze model failure modes.
3

Section 03

Minerva-Ego Benchmark Construction: Dataset and Annotations

Minerva-Ego Benchmark Construction: Dataset and Annotations

Dataset Construction

  • High-quality first-person/embodied environment videos, ensuring scene diversity;
  • Multi-step reasoning questions requiring integration of multi-spatiotemporal information;
  • Manually annotated reasoning trajectories (key frames, spatial regions, intermediate steps, etc.).

Fine-grained Spatiotemporal Mask Annotations

  • Object-level annotations: Spatiotemporal ranges of key objects;
  • Fine-grained localization: Annotating "what", "where", and "when";
  • Reasoning dependency visualization: Clearly showing necessary visual information.
4

Section 04

Core Findings: Significant Effects of Spatiotemporal Cues

Core Findings: Significant Effects of Spatiotemporal Cues

Value of "When" Cues

  • Reduces noise interference, focusing on key time periods;
  • Improves computational efficiency by prioritizing key frames;
  • Enhances temporal reasoning, establishing correct temporal relationships.

Value of "Where" Cues

  • Focuses on relevant spatial regions;
  • Understands relative positions and interactions between objects;
  • Handles occlusion and moving localization cues.

Synergistic Effect

The performance improvement from providing both spatiotemporal cues is greater than the sum of individual cues, indicating that spatiotemporal information is interdependent.

5

Section 05

Model Performance Gap: Comparison with Humans

Model Performance Gap: Comparison with Humans

Multi-step Reasoning Challenges

  • Difficulty in information integration: Struggles to combine scattered spatiotemporal information;
  • Weak causal reasoning: Understanding causal and temporal dependencies between actions;
  • Long-range dependency issues: Decreased information coherence as time span increases.

Fine-grained Localization Limitations

  • Boundary ambiguity: Difficulty in precisely localizing the spatiotemporal boundaries of objects;
  • Small object omission: Tends to ignore small but key objects;
  • Dynamic tracking difficulty: Tracking the spatiotemporal trajectories of moving objects.
6

Section 06

Application Scenarios and Training Insights

Application Scenarios and Training Insights

Agent Systems

  • Focuses on task-relevant regions, guides actions at appropriate times, and improves dynamic adaptability.

Video QA Systems

  • Interactive cues: Users provide spatial cues via clicks/drags, and the system requests time ranges for multi-round refined localization.

Model Training Strategies

  • Explicitly model spatiotemporal attention mechanisms;
  • Introduce spatiotemporal localization tasks in pre-training;
  • Design flexible architectures that can utilize external cues.
7

Section 07

Dataset Characteristics and Future Directions

Dataset Characteristics and Future Directions

Dataset Characteristics

  • Scale and diversity: Covers various daily scenarios;
  • Difficulty levels: Supports progressive evaluation;
  • Multimodal output: Text answers, spatiotemporal masks, reasoning trajectories;
  • Open-source availability: Accessible on GitHub.

Limitations and Future Directions

  • Limitations: Scene coverage (mainly daily, few professional domains), high annotation cost, insufficient cue automation;
  • Future: Automatic cue generation, expanding to professional domains/long videos, integrating audio information, real-time video stream reasoning.
8

Section 08

Conclusion: Significance of Minerva-Ego

Conclusion: Significance of Minerva-Ego

Minerva-Ego provides a comprehensive evaluation framework for first-person video understanding, focusing not only on final answers but also on the quality of reasoning processes. The core finding (spatiotemporal cues improve performance) points the way for model design, and it will serve as infrastructure to drive progress in embodied intelligence and first-person perspective applications in the future.