# Minerva-Ego: Spatiotemporal Cues Empower a New Benchmark for First-Person Video Understanding

> This article introduces the Minerva-Ego benchmark, which evaluates first-person video reasoning capabilities through multi-step multimodal questions and spatiotemporal dense human reasoning trajectories. It finds that "when" (temporal) and "where" (spatial) cues significantly improve model performance.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-14T19:12:20.000Z
- 最近活动: 2026-05-18T03:24:07.875Z
- 热度: 86.0
- 关键词: 第一人称视频, 具身智能, 时空推理, 视频理解, 基准测试, 视觉问答, 多模态
- 页面链接: https://www.zingnex.cn/en/forum/thread/minerva-ego
- Canonical: https://www.zingnex.cn/forum/thread/minerva-ego
- Markdown 来源: floors_fallback

---

## Introduction: Overview of the Minerva-Ego Benchmark

# Introduction: Overview of the Minerva-Ego Benchmark
Minerva-Ego is a new benchmark for first-person video understanding, evaluating models' reasoning capabilities through multi-step multimodal questions and spatiotemporal dense human reasoning trajectories. The core finding is that providing "when" (temporal localization) and "where" (spatial localization) cues significantly improves model performance, offering important directions for model design and training in this field.

## Research Background: Challenges in First-Person Video Understanding

# Research Background: Challenges in First-Person Video Understanding
First-person perspective videos have unique value in scenarios like robot learning, assistive technology, action recognition, and augmented reality, but existing evaluation benchmarks have limitations:
1. **Output-oriented evaluation**: Only focuses on final answers, ignoring intermediate reasoning processes;
2. **Single-modal output**: Lacks spatial/temporal localization information;
3. **Lack of fine-grained annotations**: Makes it difficult to analyze model failure modes.

## Minerva-Ego Benchmark Construction: Dataset and Annotations

# Minerva-Ego Benchmark Construction: Dataset and Annotations
## Dataset Construction
- High-quality first-person/embodied environment videos, ensuring scene diversity;
- Multi-step reasoning questions requiring integration of multi-spatiotemporal information;
- Manually annotated reasoning trajectories (key frames, spatial regions, intermediate steps, etc.).

## Fine-grained Spatiotemporal Mask Annotations
- Object-level annotations: Spatiotemporal ranges of key objects;
- Fine-grained localization: Annotating "what", "where", and "when";
- Reasoning dependency visualization: Clearly showing necessary visual information.

## Core Findings: Significant Effects of Spatiotemporal Cues

# Core Findings: Significant Effects of Spatiotemporal Cues
## Value of "When" Cues
- Reduces noise interference, focusing on key time periods;
- Improves computational efficiency by prioritizing key frames;
- Enhances temporal reasoning, establishing correct temporal relationships.

## Value of "Where" Cues
- Focuses on relevant spatial regions;
- Understands relative positions and interactions between objects;
- Handles occlusion and moving localization cues.

## Synergistic Effect
The performance improvement from providing both spatiotemporal cues is greater than the sum of individual cues, indicating that spatiotemporal information is interdependent.

## Model Performance Gap: Comparison with Humans

# Model Performance Gap: Comparison with Humans
## Multi-step Reasoning Challenges
- Difficulty in information integration: Struggles to combine scattered spatiotemporal information;
- Weak causal reasoning: Understanding causal and temporal dependencies between actions;
- Long-range dependency issues: Decreased information coherence as time span increases.

## Fine-grained Localization Limitations
- Boundary ambiguity: Difficulty in precisely localizing the spatiotemporal boundaries of objects;
- Small object omission: Tends to ignore small but key objects;
- Dynamic tracking difficulty: Tracking the spatiotemporal trajectories of moving objects.

## Application Scenarios and Training Insights

# Application Scenarios and Training Insights
## Agent Systems
- Focuses on task-relevant regions, guides actions at appropriate times, and improves dynamic adaptability.

## Video QA Systems
- Interactive cues: Users provide spatial cues via clicks/drags, and the system requests time ranges for multi-round refined localization.

## Model Training Strategies
- Explicitly model spatiotemporal attention mechanisms;
- Introduce spatiotemporal localization tasks in pre-training;
- Design flexible architectures that can utilize external cues.

## Dataset Characteristics and Future Directions

# Dataset Characteristics and Future Directions
## Dataset Characteristics
- Scale and diversity: Covers various daily scenarios;
- Difficulty levels: Supports progressive evaluation;
- Multimodal output: Text answers, spatiotemporal masks, reasoning trajectories;
- Open-source availability: Accessible on GitHub.

## Limitations and Future Directions
- Limitations: Scene coverage (mainly daily, few professional domains), high annotation cost, insufficient cue automation;
- Future: Automatic cue generation, expanding to professional domains/long videos, integrating audio information, real-time video stream reasoning.

## Conclusion: Significance of Minerva-Ego

# Conclusion: Significance of Minerva-Ego
Minerva-Ego provides a comprehensive evaluation framework for first-person video understanding, focusing not only on final answers but also on the quality of reasoning processes. The core finding (spatiotemporal cues improve performance) points the way for model design, and it will serve as infrastructure to drive progress in embodied intelligence and first-person perspective applications in the future.
