MMBench-Video:

A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

1 Shanghai AI Laboratory 2 Shanghai Jiao Tong University 3 The Chinese University of Hong Kong 4 Tongji University

* Equal contribution. Corresponding authors.
§ Work done during an internship in Shanghai AI Laboratory.

🚀 MMBench-Video: A quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding.
🚀 Contains long-form, diverse videos sourced from the web, encompassing a broad spectrum of topics.
🚀 Includes original, high-quality visual questions crafted by volunteers, spanning dozens of fine-grained capabilities.
🚀 Enhanced GPT-4-based evaluation paradigm; comprehensive assessment of various LVLMs and Video-LLMs


🔥What's New
  • [2024.06.26] The MMBench-Video has been integrated into VLMEvalKit!
  • [2024.06.26] The Project Page is released!
  • [2024.06.21] Our paper has been featured as 🤗HuggingFace Daily Papers and ranked 5th out of 24 daily papers.
  • [2024.06.20] The Paper is released!
  • [2024.06.12] The MMBench-Video Dataset is released!

Abstract

The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit.

Comparision of MMBench-Video and Other VideoQA benchmarks

The current evaluation of Video-LLMs is characterized by the following limitations:

1. Short Videos:

Existing VideoQA datasets primarily consist of short videos, typically lasting less than a minute. Meanwhile, most web video content spans several minutes or longer, creating a discrepancy between the evaluation benchmark and real-world application scenarios.

2. Limited Capabilities:

Current VideoQA benchmarks are limited to several basic video tasks, including concept existence, object relationship recognition, and activity recognition. There are more fine-grained perception and reasoning capabilities not encompassed by existing benchmarks.

3. Biased Evaluation:

Existing evaluation paradigms employ GPT-3.5 to score open-ended answers generated by video-language models. Our preliminary study indicates that GPT-3.5-based evaluation is less accurate and exhibits significant discrepancy relative to human preferences, diminishing the credibility of the evaluation results.

Benchmarks QA pairs Generation Number of Capabilities Question Length mean(std) words Answer Length mean(std) words Video duration mean(std) sec Shot Number mean(std)
MSVD-QA Automatic 2 6.6(2.5) 1.0(0.0) 9.8(6.6) 2.4(3.4)
MSRVTT-QA Automatic 2 7.4(3.4) 1.0(0.0) 15.1(5.2) 3.4(2.9)
TGIF-QA Automatic/Human 4 9.7(2.3) 1.5(0.9) 3.7(2.0) 1.2(1.4)
ActivityNet-QA Human 3 8.9(2.4) 1.3(0.7) 111.5(66.1) 12.9(20.9)
MMBench-Video Human 26 10.9(4.1) 8.4(7.7) 165.4(80.7) 32.6(33.5)

MMBench-Video: a long-form, multi-shot VideoQA benchmark with diverse video categories: (a) The dataset covers a broad spectrum of categories, including science, sports, finance, games, news, etc. (b) The dataset includes videos ranging from 30 seconds to 6 minutes in length. (c) The videos in our benchmark display a long-tail distribution in shot numbers, with a maximum of 210 shots.


Capability Taxonomy of MMBench-Video

3-level hierarchical capability texonomy of MMBench-Video


The top level encompasses two broad capabilities: Perception and Reasoning. Besides the six L-2 capabilities inherited from MMBench, we further introduce three additional L-2 capabilities specific to MMBench-Video: Hallucination, Commonsense Reasoning, and Temporal Reasoning. Hallucination assesses whether a model is prone to generating content that includes misleading or inaccurate information. Commonsense Reasoning evaluates a model’s ability to integrate necessary commonsense knowledge into its reasoning processes. Temporal Reasoning examines a model’s proficiency in understanding the relationships between events unfolding at different video points. This taxonomy comprises a total of 26 leaf capabilities, which collectively address a comprehensive spectrum of cognitive processes involved in video comprehension.


MMBench-Video Leaderboard

Perception: CP (coarse perception), FP-S (single-instance fine-grained perception), FP-C (cross-instance fine-grained perception), HL (Hallucination)

Reasoning: LR (logic reasoning), AR (attribute reasoning), RR (relation reasoning), CSR (commonsense reasoning), TR (temporal reasoning)

The best results are highlighted in bold and underlined.

Model Perception Reasoning Overall Mean Model Type
CP FP-S FP-C HL Mean LR AR RR CSR TR Mean
GPT-4o-[8f] 🥇 1.82 1.59 1.43 1.95 1.63 1.33 1.89 1.60 1.60 1.44 1.57 1.62 Proprietary LVLMs for Images
GPT-4v-[8f] 🥈 1.68 1.45 1.43 1.79 1.51 1.14 1.81 1.70 1.59 1.39 1.52 1.53 Proprietary LVLMs for Images
Gemini-Pro-v1.0-[8f] 🥉 1.72 1.50 1.28 0.79 1.49 1.02 1.66 1.58 1.59 1.40 1.45 1.49 Proprietary LVLMs for Images
Gemini-Pro-v1.5-[8f] 1.51 1.30 0.98 2.03 1.32 1.06 1.62 1.36 1.25 0.94 1.22 1.30 Proprietary LVLMs for Images
InternVL-Chat-v1.5-[8f] 1.51 1.22 1.01 1.21 1.25 0.88 1.40 1.48 1.28 1.09 1.22 1.26 Open-Source LVLM for Images
Claude-3v-Opus-[4f] 1.37 1.11 1.00 1.56 1.16 1.12 1.35 1.36 1.17 1.05 1.20 1.19 Proprietary LVLMs for Images
mPLUG-Owl2-[8f] 1.34 1.18 0.99 0.27 1.15 0.63 1.33 1.30 1.03 1.11 1.11 1.15 Open-Source LVLM for Images
VideoStreaming-[64f+] 1.38 1.13 0.80 0.32 1.13 0.77 1.27 1.11 1.01 1.10 1.09 1.12 Open-Source Video LLM
idefics2-8B-[8f] 1.23 1.07 0.89 0.77 1.06 0.77 1.27 1.41 1.11 1.14 1.16 1.10 Open-Source LVLM for Images
ShareGPT4Video-8B-[16f*] 1.20 1.05 1.00 0.32 1.04 0.89 1.06 1.19 1.01 0.99 1.03 1.05 Open-Source Video LLM
Video-LLaVA-[8f] 1.14 1.08 0.88 0.50 1.04 0.72 1.23 1.03 0.89 0.97 0.99 1.05 Open-Source Video LLM
PLLaVA-7B-[16f] 1.08 1.06 0.86 0.52 1.02 0.64 1.25 1.17 0.98 1.01 1.03 1.03 Open-Source Video LLM
Chat-UniVi-[64f] 1.07 1.00 0.93 0.39 0.98 0.59 1.18 1.14 0.75 0.98 0.97 0.99 Open-Source Video LLM
VideoChat2-[16f] 1.18 0.94 0.98 0.66 0.98 0.42 1.13 1.24 0.86 0.94 0.95 0.99 Open-Source Video LLM
Video-ChatGPT-[100f] 0.91 0.94 0.81 0.39 0.90 0.70 1.15 1.12 0.84 0.94 0.97 0.93 Open-Source Video LLM
Qwen-8B-[8f] 0.44 0.62 0.33 0.15 0.53 0.45 0.59 0.50 0.36 0.37 0.45 0.52 Open-Source LVLM for Images

📃 BibTeX


          @article{fang2024mmbenchvideo,
            title={MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding}, 
            author={Xinyu Fang and Kangrui Mao and Haodong Duan and Xiangyu Zhao and Yining Li and Dahua Lin and Kai Chen},
            journal={arXiv preprint arXiv:2406.14515},
            year={2024}
          }