TVBench: Redesigning Video-Language Evaluation

1University of Santiago de Compostela 2University of Amsterdam 3University of Technology Nuremberg

Overview

Large language models have demonstrated impressive performance when integrated with vision models, even enabling video understanding. However, evaluating these video models presents its own unique challenges, for which several benchmarks have been proposed. In our paper, we show that the currently most used video-language benchmarks, such as MVBench, can be solved without requiring much temporal reasoning and propose as a solution a new benchmark, TVBench, providing an effective evaluation tool for current video-language models:

  • MVBench contains unlikely candidate answers easily dismissed from a single frame.
    → We provide only temporally challenging candidate answers, requiring models to leverage temporal information to answer correctly.
  • MVBench contains QA pairs with obvious solutions due to LLM biased generation.
    → To address this, we generate questions using text templates, preventing grammatical issues that could be exploited by text-only models.
  • MVBench contains QA pairs that can be solved by solely relying on prior world knowledge.
    → We design questions that can be answered purely from the video content, without relying on prior world knowledge.

In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues, while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative. As a solution, we propose TVBench, a novel open-source video multiple-choice question-answering benchmark, and demonstrate through extensive evaluations that it requires a high level of temporal understanding. Surprisingly, we find that most recent state-of-the-art video-language models perform similarly to random performance on TVBench, with only Gemini-Pro and Tarsier clearly surpassing this baseline.

Diagram showing TVBench architecture

Does Time Matter?

Diagram showing TVBench architecture

Video benchmarks must define tasks that cannot be solved using solely spatial information to effectively evaluate a model's temporal understanding. In video multiple-choice question answering (MCQA), questions should not be answerable using spatial details from a single random frame or shuffled frames. If temporal understanding is not required, the benchmark only assesses spatial information, which we define as spatial bias.

We analyze this spatial bias in MVBench using various state-of-the-art image and video-language models across four tasks: i) scene transition, ii) fine-grained pose, iii) episodic reasoning, and iv) fine-grained action. Image-language models perform unexpectedly well on these tasks, often matching video-language models despite the expectation of temporal reasoning. For example, in the Fine-grained Action task, the image-based GPT-4o model scores 49%, slightly better than the state-of-the-art video model Tarsier at 48.5%.

Shuffling videos has minimal impact on video-language models, further indicating a lack of temporal dependency. Across all 20 MVBench tasks, GPT-4o achieves an average accuracy of 47.8%, 20.5% above random baseline, highlighting spatial bias. Both VideoChat2 and Tarsier exhibit high consistency between answers from original and shuffled videos, with VideoChat2 providing the same answer 91% of the time and Tarsier 82%. These findings suggest that spatial bias affects the entire MVBench dataset.

Diagram showing TVBench architecture

Does Vision Matter?

Diagram showing TVBench architecture

Video benchmarks must ensure that questions cannot be answered purely through common sense reasoning. Modern LLMs possess strong reasoning skills, which can exploit the question and candidate sets in MCQA video-language evaluation benchmarks, introducing substantial textual bias. This allows models to answer questions without actually leveraging video content.

We evaluated this textual bias in MVBench by testing text-only LLMs. Results show that LLMs can eliminate incompatible candidates easily, often matching video-language models' performance. For instance, Llama 3 achieves 44.5% on the Action Count task, closely trailing Tarsier at 46.5%. Across all 20 tasks, Llama 3 averages 38.1%, significantly outperforming the random baseline of 27.3%. We identified three key sources of this bias:

  • Bias from LLM-based QA generation: Many tasks in MVBench use QA pairs generated by LLMs like ChatGPT, leading to unrealistic candidates or poorly phrased questions that can be solved without visual input. For example, in the Action Antonym task, answers are often unrealistic or irrelevant, allowing text-only models to easily find the correct answer.
  • Bias from unbalanced QA sets: Unbalanced candidate sets, such as the Action Count task, where '3' is the correct answer for 45% of the questions, skew the evaluation. Text-only models, like GPT-4o, capitalize on this by overpredicting common answers, yielding results close to video models.
  • Overreliance on world knowledge: LLMs can leverage prior world knowledge to guess answers without visual reasoning, even when the questions are well-designed.
Diagram showing TVBench architecture

Open-ended QA to the rescue?

Diagram showing TVBench architecture

Unlike multiple-choice question answering (MCQA), open-ended question answering removes the reliance on predefined candidates, potentially solving the above mentioned issues. However, it introduces new challenges. Following Maaz et al. (2023), LLMs like GPT-3.5 are used to evaluate open-ended question-answers, relying on private APIs which can be unreliable. The evaluation model scores responses based on their alignment with the question and ground-truth answer.

We analyzed the impact of using different evaluators, GPT-3.5 and Llama3-70B, on three open-ended datasets. The results varied significantly, with discrepancies over 20 points, especially benefiting text-only and single-image models. Llama3 often rated incorrect predictions higher, highlighting the risk of shared biases between the prediction and evaluation models.

Moreover, open-ended QA does not fully resolve MCQA's problems. Text-only models can guess answers solely from the question, showing strong performance even without temporal understanding. For example, GPT-4o achieves accuracy rates close to video-language models, and shuffling video frames barely affects results, indicating that temporal understanding is not crucial for these tasks.

In conclusion, open-ended benchmarks remain unreliable due to the use of biased LLMs as evaluators. These benchmarks also suffer from spatial and textual biases, making them unsuitable for evaluating video-language models, while relying on costly, non-reproducible private APIs.

Diagram showing TVBench architecture

TVBench: Temporal Video-Language Benchmark

Diagram showing TVBench architecture

We propose TVBench, a new benchmark designed to evaluate temporal understanding in video QA. Using a multiple-choice approach, TVBench addresses issues with open-ended VQA evaluations. Its principles focus on overcoming the problems outlined in existing video MCQA benchmarks, and can be applied to any video QA benchmark to enhance temporal complexity.

Developing a Temporal Video-Language Benchmark

1. Defining hard answer candidates:

To ensure temporal constraints are essential for answering questions correctly, we designed questions with temporal challenging candidates. The figure above illustrates our solution to Problem 1, where hard candidates cannot be dismissed without temporal understanding. Tasks such as Action Count, Object Shuffle, Action Localization, and Scene Transition require temporal reasoning, ensuring single-frame models and shuffled video inputs perform at random levels.

2. Reducing overly informative questions:

We use templates to avoid text-bias and ensure questions do not give away the answer. Tasks like Action Count, Object Count, and Moving Direction use a balanced candidate set, where each candidate appears equally as the correct answer. This reduces textual bias, requiring models to rely on visual understanding.

3. Omitting questions requiring prior knowledge:

To mitigate Problem 2, we eliminate tasks that rely on world knowledge, such as Episodic Reasoning from MVBench. TVBench focuses solely on visual information, removing any reliance on common knowledge to ensure evaluations test temporal reasoning and video understanding.

Diagram showing TVBench architecture

Evaluating TVBench

The figure above summarizes the results of TVBench and contrasts them to MVBench.

Does Time Matter?

On TVBench, single-image models perform at random levels, indicating that a random frame is insufficient for accurate answers. For example, GPT-4o outperforms random chance by only 2.5% on TVBench, compared to a 20.5% improvement on MVBench. Shuffling videos has little effect on video models' performance on MVBench but significantly degrades accuracy on TVBench, showing the importance of temporal context. Reversing videos further worsens performance on TVBench, with models like Tarsier-7B and Tarsier-34B scoring below random levels, illustrating their reliance on correct temporal order.

Does Vision Matter?

Text-only models perform at random levels on TVBench, indicating our solutions to Problem 2 are effective. For instance, Llama 3 performs only 1.4% above random chance on TVBench, compared to a 10.8% improvement on MVBench. This demonstrates that LLMs cannot rely on the question and answer candidates or prior knowledge, making visual information crucial for solving TVBench.

Overview of TVBench

Diagram showing TVBench architecture
Above is a list of all tasks and the corresponding dataset used for TVBench.

BibTeX

@InProceedings{cores2025tvbench,
  author    = {Daniel Cores and Michael Dorkenwald and Manuel Mucientes and Cees G. M. Snoek and Yuki M. Asano},
  title     = {{TVBench}: Redesigning Video-Language Evaluation},
  journal   = {ArXiv},
  year      = {2024}
}