Zing Forum

Reading

Farewell to Reference Answer Dependence: QQEQE提出QEVAVA Proposes aa New Paradigm for Referencepse-free Video Summ Summarization Evaluation

QEVA directly evaluates video summaries via multimodal question answering without relying on manual reference answers, achieving more accurate assessment across three dimensions—coverage, factuality, and temporal order—and releases the MLVU(VS)-Eval benchmark dataset.

视频摘要无参考评估多模态问答QEVA视频理解大语言模型机器学习评估
Published 2026-04-27 13:18Recent activity 2026-04-28 11:47Estimated read 5 min
Farewell to Reference Answer Dependence: QQEQE提出QEVAVA Proposes aa New Paradigm for Referencepse-free Video Summ Summarization Evaluation
1

Section 01

Introduction: QEVA—A New Paradigm for Reference-Free Video Summarization Evaluation

Traditional video summarization evaluation relies on manual reference answers, which has problems such as high cost and insufficient semantic capture. QEVA proposes a new reference-free evaluation paradigm, assessing summary quality across three dimensions (coverage, factuality, and temporal order) via multimodal question answering, and releases the MLVU(VS)-Eval benchmark dataset. Experimental results are highly consistent with human judgments.

2

Section 02

Background: Dilemmas of Traditional Video Summarization Evaluation

With the explosive growth of video content, automatic video summarization technology is crucial, but evaluation methods have flaws: traditional n-gram overlap metrics (ROUGE, BLEU) rely on manual reference answers, which are costly and struggle to capture semantic differences; recent LLM-based evaluation methods still depend on reference answers, limiting their practicality and semantic sensitivity.

3

Section 03

Core Innovation: QEVA's Reference-Free Evaluation Framework

QEVA (Question-based Evaluation for Video Summarization with Multimodal Answering) core insight: A good summary should be able to answer key questions about the original video. It evaluates from three dimensions:

  • Coverage: Whether the summary covers important information in the video
  • Factuality: Whether the summary content is consistent with the video facts
  • Temporal Order: Whether the summary accurately reflects the chronological order of events
4

Section 04

Technical Details: Implementation Steps of Multimodal Question Answering

QEVA evaluation process:

  1. Extract visual features of key video frames and candidate summary text
  2. Generate multimodal questions for the video content (requiring simultaneous understanding of images and text)
  3. Use a multimodal QA model to answer the questions based on the original video and the candidate summary respectively
  4. Compare the consistency of answers to assess summary quality: high consistency indicates good quality, while inconsistency suggests information gaps or errors.
5

Section 05

Evidence Support: MLVU(VS)-Eval Benchmark and Experimental Results

The research team released the MLVU(VS)-Eval benchmark dataset: built on the MLVU video understanding dataset, it contains 200 videos and 800 summaries generated by advanced models, providing a transparent and consistent QA annotation framework. In experiments, QEVA significantly outperformed existing methods in metrics such as Kendall's τ_b, τ_c, and Spearman's ρ, and had higher correlation with human judgments.

6

Section 06

Industry Impact: Significance of QEVA for the Video Summarization Field

QEVA reduces evaluation costs (no need for manual reference answers) and enables large-scale evaluation; improves fairness (unified QA system avoids bias); supports practical deployment (reference-free feature can be directly applied to production environments, providing reliable metrics for quality monitoring and model iteration).

7

Section 07

Limitations and Outlook: Future Research Directions

QEVA limitations: The upper limit of multimodal QA model capabilities affects evaluation accuracy, and its deep reasoning ability for complex videos needs improvement; it does not cover language-level quality such as summary fluency and readability. Future research can incorporate these dimensions to build a more comprehensive evaluation system and promote progress in the video summarization field.