Section 01
Introduction: Overview of the Minerva-Ego Benchmark
Introduction: Overview of the Minerva-Ego Benchmark
Minerva-Ego is a new benchmark for first-person video understanding, evaluating models' reasoning capabilities through multi-step multimodal questions and spatiotemporal dense human reasoning trajectories. The core finding is that providing "when" (temporal localization) and "where" (spatial localization) cues significantly improves model performance, offering important directions for model design and training in this field.