3vivo Mobile Communication Co., Ltd
†Indicates Corresponding Author
- [2025-10-02]: 🚀 Codes released.
Cinematography understanding refers to the ability to recognize not only the visual content of a scene but also the cinematic techniques that shape narrative meaning. This capability is attracting increasing attention, as it enhances multimodal understanding in real-world applications and underpins coherent content creation in film and media. As the most comprehensive benchmark for this task, ShotBench spans a wide range of cinematic concepts and VQA-style evaluations, with ShotVL achieving state-of-the-art results on it. However, our analysis reveals that ambiguous option design in ShotBench and ShotVL’s shortcomings in reasoning consistency and instruction adherence undermine evaluation reliability, limiting fair comparison and hindering future progress. To overcome these issues, we systematically refine ShotBench through consistent option restructuring, conduct the first critical analysis of ShotVL’s reasoning behavior, and introduce an extended evaluation protocol that jointly assesses task accuracy and core model competencies. These efforts lead to RefineShot, a refined and expanded benchmark that enables more reliable assessment and fosters future advances in cinematography understanding.
Our contributions are as below:
-
Benchmark Refinement. We redesign the multiple-choice option sets in ShotBench by enforcing consistent granularity, unified evaluation dimensions, and mutual exclusivity. This renders a coherent and reliable dataset for evaluating cinematography understanding.
-
Critical Analysis of State-of-the-Art Baselines. We conduct the first in-depth study of ShotVL, the reported state-of-the-art on ShotBench, and reveal fundamental weaknesses in reasoning, prompt adherence, and output consistency, challenging the validity of its benchmark superiority.
-
Expanded Evaluation Protocol. We augment ShotBench with a new protocol that jointly assesses task-specific performance and core model competencies, providing a more balanced and robust framework for fair comparison and future progress in this emerging field. Together, these contributions establish RefineShot, a refined and extended benchmark for cinematography understanding.
conda env create -n shot
source activate shot
cd shot
bash setup.shYou can use the script below to download and unzip the original ShotBench dataset:
bash data.shThen you can use the scipt below to obtained the RefinedBench. Detailed refinment steps can be found in the paper. Remember to first change the file name test.tsv into test_origin.tsv, or you can modify extract.py.
python extract.pyUse the shell script to evaluate different vlms, You can change the parameters like MODEL_NAME and CATEGORY to run different experiments.:
bash evaluate_qwen.sh
bash evaluate_shotvl.shThen you can run the calculate.sh to calculate the metrics. In the calculate.sh script, you can add the arguments --check_adherence and --check_consistency to calculate the two reliability scores proposed in our paper.
bash calculate.sh
python evaluation/calculate_scores.py --check_adherence
python evaluation/calculate_scores.py --check_consistency- Model Analysis. This figure shows two main defects of ShotVL models: reasoning unfaithfulness, with frequent mismatches between reasoning and answers, and poor instruction adherence, where prompts are ignored in favor of long repetitive outputs.
- Instruction adherence case. This case shows the instruction adherence of different models. When given a demonstration-based prompt, ShotVL fails to follow the instructions and produces disorganized reasoning, whereas Qwen accurately follows the format, outputting each step and the final answer as required
- Experimental results of models after consistency check. We evaluate all model outputs for consistency between reasoning and final answers, treating mismatched cases as incorrect. The Qwen series shows almost no performance drop, while ShotVL suffers a notable decrease, indicating weaker reasoning faithfulness.
- Performance of different models on the refined benchmark under reasoning and step-by-step prompts. Qwen remains stable across prompts, while ShotVL shows clear performance drops and higher time cost. Time cost is reported in hours:minutes (hh:mm) format.
- Performance and reliability of different models on the refined benchmark. Using our proposed evaluation, we find that Qwen achieves near-perfect reliability, significantly outperforming ShotVL, whose results are notably weaker, particularly in instruction adherence.
If you find our project useful, we hope you can star our repo and cite our paper as follows:
@article{wu2025refineshot,
title={RefineShot: Rethinking Cinematography Understanding with Foundational Skill Evaluation},
author={Wu, Hang and Cai, Yujun and Ge, Haonan and Chen, Hongkai and Yang, Ming-Hsuan and Wang, Yiwei},
journal={arXiv preprint arXiv:2510.02423},
year={2025}
}
Our repository is based on the following projects, we sincerely thank them for their great efforts and excellent work.
- ShotBench: latest cinemagraphy understanding benchmark.
This project is licensed under the terms of the Apache License 2.0. You are free to use, modify, and distribute this software under the conditions of the license. See the LICENSE file for details.





