MMBench-Video:

A Long-Form Multi-Shot Benchmark for Holistic Video Understanding

(Accepted by NeurIPS D&B Track 2024)

Xinyu Fang^{* 1,2}, Kangrui Mao^{* §3}, Haodong Duan^†2, Xiangyu Zhao^2,3, Yining Li², Dahua Lin^2,4, Kai Chen^†2

▶ ¹ Zhejiang University ▶ ² Shanghai AI Laboratory ▶ ³ Shanghai Jiao Tong University ▶ ⁴ The Chinese University of Hong Kong

* Equal contribution. † Corresponding authors.

^§ Work done during an internship in Shanghai AI Laboratory.

arXiv Code (Integrated Into VLMEvalKit)

🤗

Dataset

🏅

OpenVLM Video Leaderboard

🚀 MMBench-Video: A quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding.
🚀 Contains long-form, diverse videos sourced from the web, encompassing a broad spectrum of topics.
🚀 Includes original, high-quality visual questions crafted by volunteers, spanning dozens of fine-grained capabilities.
🚀 Enhanced GPT-4-based evaluation paradigm; comprehensive assessment of various LVLMs and Video-LLMs

🔥What's New

[2024.09.26] The MMBench-Video has been accepted at NeurIPS D&B Track 2024 as a poster! See you in Vancouver!
[2024.09.24] The OpenVLM Video Leaderboard has been established, you can click here to view the video understanding capabilites of each VLMs!
[2024.06.26] The MMBench-Video has been integrated into VLMEvalKit!
[2024.06.26] The Project Page is released!
[2024.06.21] Our paper has been featured as 🤗HuggingFace Daily Papers and ranked 5th out of 24 daily papers.
[2024.06.20] The Paper is released!
[2024.06.12] The MMBench-Video Dataset is released!

Abstract

The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit.

Comparision of MMBench-Video and Other VideoQA benchmarks

The current evaluation of Video-LLMs is characterized by the following limitations:

1. Short Videos:

Existing VideoQA datasets primarily consist of short videos, typically lasting less than a minute. Meanwhile, most web video content spans several minutes or longer, creating a discrepancy between the evaluation benchmark and real-world application scenarios.

2. Limited Capabilities:

Current VideoQA benchmarks are limited to several basic video tasks, including concept existence, object relationship recognition, and activity recognition. There are more fine-grained perception and reasoning capabilities not encompassed by existing benchmarks.

3. Biased Evaluation:

Existing evaluation paradigms employ GPT-3.5 to score open-ended answers generated by video-language models. Our preliminary study indicates that GPT-3.5-based evaluation is less accurate and exhibits significant discrepancy relative to human preferences, diminishing the credibility of the evaluation results.

Benchmarks	QA pairs Generation	Number of Capabilities	Question Length mean(std) words	Answer Length mean(std) words	Video duration mean(std) sec	Shot Number mean(std)
MSVD-QA	Automatic	2	6.6(2.5)	1.0(0.0)	9.8(6.6)	2.4(3.4)
MSRVTT-QA	Automatic	2	7.4(3.4)	1.0(0.0)	15.1(5.2)	3.4(2.9)
TGIF-QA	Automatic/Human	4	9.7(2.3)	1.5(0.9)	3.7(2.0)	1.2(1.4)
ActivityNet-QA	Human	3	8.9(2.4)	1.3(0.7)	111.5(66.1)	12.9(20.9)
MMBench-Video	Human	26	10.9(4.1)	8.4(7.7)	165.4(80.7)	32.6(33.5)

MMBench-Video: a long-form, multi-shot VideoQA benchmark with diverse video categories: (a) The dataset covers a broad spectrum of categories, including science, sports, finance, games, news, etc. (b) The dataset includes videos ranging from 30 seconds to 6 minutes in length. (c) The videos in our benchmark display a long-tail distribution in shot numbers, with a maximum of 210 shots.

Capability Taxonomy of MMBench-Video

3-level hierarchical capability texonomy of MMBench-Video

The top level encompasses two broad capabilities: Perception and Reasoning. Besides the six L-2 capabilities inherited from MMBench, we further introduce three additional L-2 capabilities specific to MMBench-Video: Hallucination, Commonsense Reasoning, and Temporal Reasoning. Hallucination assesses whether a model is prone to generating content that includes misleading or inaccurate information. Commonsense Reasoning evaluates a model’s ability to integrate necessary commonsense knowledge into its reasoning processes. Temporal Reasoning examines a model’s proficiency in understanding the relationships between events unfolding at different video points. This taxonomy comprises a total of 26 leaf capabilities, which collectively address a comprehensive spectrum of cognitive processes involved in video comprehension.

MMBench-Video Leaderboard

Perception: CP (coarse perception), FP-S (single-instance fine-grained perception), FP-C (cross-instance fine-grained perception), HL (Hallucination)

Reasoning: LR (logic reasoning), AR (attribute reasoning), RR (relation reasoning), CSR (commonsense reasoning), TR (temporal reasoning)

The best results are highlighted in bold and underlined.

You can also view the latest MMBench-Video Leaderboard at OpenVLM Video Leaderboard!

Model	Perception					Reasoning						Overall Mean	Model Type
Model	CP	FP-S	FP-C	HL	Mean	LR	AR	RR	CSR	TR	Mean	Overall Mean	Model Type
GPT-4o-0806-[16f] 🥇	2.04	1.89	1.66	2.15	1.89	1.75	2.04	1.81	1.93	1.62	1.81	1.87	Proprietary LVLMs for Images
Gemini-1.5-Flash-[16f] 🥈	1.77	1.69	1.52	1.29	1.66	1.46	1.88	1.78	1.75	1.39	1.63	1.66	Proprietary LVLMs for Images
Aria-[16f] 🥉	1.84	1.63	1.40	0.95	1.61	1.22	1.88	1.71	1.59	1.48	1.58	1.61	Open-Source LVLMs for Images
VILA1.5-40B-[14f] 🥉	1.78	1.72	1.35	0.47	1.63	1.12	1.78	1.61	1.48	1.45	1.52	1.61	Open-Source LVLMs for Images
InternVL2-76B-[16f]	1.76	1.66	1.38	0.35	1.59	1.40	1.81	1.73	1.70	1.41	1.59	1.59	Open-Source LVLM for Images
Claude-3.5-Sonnet-[8f]	1.57	1.39	1.07	1.40	1.38	1.13	1.70	1.48	1.54	1.04	1.35	1.38	Proprietary LVLMs for Images
mPLUG-Owl3-[16f]	1.56	1.42	1.18	0.35	1.37	0.89	1.55	1.42	1.31	1.18	1.28	1.35	Open-Source LVLM for Images
VideoChat2-HD-[16f]	1.45	1.19	1.12	0.44	1.20	0.84	1.49	1.39	1.11	1.23	1.23	1.22	Open-Source LVLM for Videos
Phi-3.5-Vision-[16f]	1.44	1.26	0.93	0.48	1.22	0.81	1.42	1.28	1.12	0.90	1.11	1.20	Open-Source LVLM for Images
PLLaVA-34B-[16f]	1.41	1.18	0.93	1.00	1.19	0.66	1.43	1.25	1.28	1.10	1.16	1.19	Open-Source LVLM for Videos
LLaVA-NeXT-Video-34B-HF-[32f]	1.35	1.15	0.97	0.58	1.14	0.64	1.38	1.30	1.27	1.03	1.13	1.14	Open-Source LVLM for Videos
VideoStreaming-[64f+]	1.38	1.13	0.80	0.32	1.13	0.77	1.27	1.11	1.01	1.10	1.09	1.12	Open-Source LVLM for Videos
LLaMA-VID-7B-[1fps]	1.30	1.09	0.93	0.42	1.09	0.71	1.21	1.08	0.83	1.04	1.02	1.08	Open-Source LVLM for Videos
Chat-UniVi-7B-v1.5-[64f]	1.32	1.08	0.87	0.40	1.08	0.57	1.19	1.03	0.90	1.01	0.99	1.06	Open-Source LVLM for Videos
ShareGPT4Video-8B-[16f*]	1.20	1.05	1.00	0.32	1.04	0.89	1.06	1.19	1.01	0.99	1.03	1.05	Open-Source LVLM for Videos
Video-LLaVA-[8f]	1.17	1.06	0.84	0.42	1.04	0.54	1.24	1.02	0.72	0.96	0.95	1.03	Open-Source LVLM for Videos

📃 BibTeX


          @article{fang2024mmbenchvideo,
            title={MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding}, 
            author={Xinyu Fang and Kangrui Mao and Haodong Duan and Xiangyu Zhao and Yining Li and Dahua Lin and Kai Chen},
            journal={arXiv preprint arXiv:2406.14515},
            year={2024}
          }