Skip to content

Adds PerceptionLM and PLM-VideoBench #638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 18, 2025
Merged

Conversation

mmaaz60
Copy link
Contributor

@mmaaz60 mmaaz60 commented Apr 17, 2025

This pull request adds,

  1. Perception Language Model (PLM)
  2. PLM-VideoBench Benchmarks.

Perception Language Model (PLM) is a state-of-the-art, fully open and reproducible MLLM for transparent research in image and video understanding. It was introduced in "PerceptionLM: Open-Access Data and Models for
Detailed Visual Understanding
".

PLM-VideoBench is a collection of human-annotated resources for evaluating Vision Language models, focused on detailed video understanding. It was also introduced in the same paper.

PLM-VideoBench includes 5 evaluation tasks.

  1. FGQA - In this task, a model must answer a multiple-choice question (MCQ) that probes fine-grained activity understanding. Given a question and multiple options that differ in a fine-grained detail (e.g., painting vertically vs. horizontally), the model must select the correct answer. To reduce bias, we follow prior work and report multi-binary accuracy (MBAcc).
  2. SGQA - In this task, a model must answer open-ended questions about activities and objects visible in an egocentric video stream recorded by a smart-glasses device. To evaluate performance we use LLM-judge accuracy.
  3. RCap - In this task, the model must generate a detailed description of an event involving a subject of interest in the video. Given a region mask and a specified time interval, the model is required to output a caption that accurately describes the event occurring within that interval.
  4. RTLoc - This task is the inverse of RCap --- instead of generating the caption, the model receives it as input and generates the corresponding time interval.
  5. RDCap - In this task, a model must generate a detailed description of all events involving a specific subject of interest (e.g., a person, animal, or object) in a video. Given a video and a region masklet, the model must produce a sequence of (start, end, caption) tuples that cover the entire duration of the video, including periods when the subject is not visible. We report SODA score, which leverages an LLM judge to assess the quality of the generated captions.

GitHub: https://github.com/facebookresearch/perception_models

Models: PLM weights are available in 1B, 3B and 8B size.

PLM Image and Video Benchmark Results

Please refer to Table # 3 and 4 on the PLM Paper for a detailed comparison with other MLLM baselines.

PLM VideoBench Results

Our implementation successfully reproduce the PLM-VideoBench results reported in the paper. We also provide detailed instructions for evaluating in PLM on PLM-VideoBench in at lmms_eval/tasks/plm_videobench/README.md.

Model Method FGQA (MBacc) SGQA (Acc) RTLoc (meanR) Avg.
PLM-8B Reported in the paper 67.7 46.2 59.1 57.6
PLM-8B Reproduce using lmms-eval 67.9 45.6 59.7 57.7

The PR also updates some task configs to add pre and post prompts for PLM. Additionally, it adds yerevann/coco-karpathy to lmms-eval as well.

Please refer to perception_models/apps/plm/docs/evaluation.md for instructions on evaluating PLM on multiple image and video benchmarks using lmms-eval.

The pull request does not break any existing functionality in lmms-eval.

@Luodian
Copy link
Contributor

Luodian commented Apr 18, 2025

Great work!

@Luodian Luodian self-assigned this Apr 18, 2025
@Luodian Luodian merged commit 9596fbd into EvolvingLMMs-Lab:main Apr 18, 2025
1 check failed
@mmaaz60
Copy link
Contributor Author

mmaaz60 commented Apr 18, 2025

Hi @Luodian,

Thank you for reviewing and merging the pull request. I really appreciate it. I wanted to ask if you can update the main README announcing that Perception Language Model and PLM-VideoBench have added to lmms-eval? Thanks.

@Luodian
Copy link
Contributor

Luodian commented Apr 19, 2025

Hi @Luodian,

Thank you for reviewing and merging the pull request. I really appreciate it. I wanted to ask if you can update the main README announcing that Perception Language Model and PLM-VideoBench have added to lmms-eval? Thanks.

Yes sure, welcome announce it here~

dadwadw233 pushed a commit to dadwadw233/lmms-eval that referenced this pull request Apr 28, 2025
* Implements PLM and PLM-VideoBench from 'PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding'

* Updates docs.

* Removes redundant code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants