Adds PerceptionLM and PLM-VideoBench #638

mmaaz60 · 2025-04-17T22:57:24Z

This pull request adds,

Perception Language Model (PLM)
PLM-VideoBench Benchmarks.

Perception Language Model (PLM) is a state-of-the-art, fully open and reproducible MLLM for transparent research in image and video understanding. It was introduced in "PerceptionLM: Open-Access Data and Models for
Detailed Visual Understanding".

PLM-VideoBench is a collection of human-annotated resources for evaluating Vision Language models, focused on detailed video understanding. It was also introduced in the same paper.

PLM-VideoBench includes 5 evaluation tasks.

FGQA - In this task, a model must answer a multiple-choice question (MCQ) that probes fine-grained activity understanding. Given a question and multiple options that differ in a fine-grained detail (e.g., painting vertically vs. horizontally), the model must select the correct answer. To reduce bias, we follow prior work and report multi-binary accuracy (MBAcc).
SGQA - In this task, a model must answer open-ended questions about activities and objects visible in an egocentric video stream recorded by a smart-glasses device. To evaluate performance we use LLM-judge accuracy.
RCap - In this task, the model must generate a detailed description of an event involving a subject of interest in the video. Given a region mask and a specified time interval, the model is required to output a caption that accurately describes the event occurring within that interval.
RTLoc - This task is the inverse of RCap --- instead of generating the caption, the model receives it as input and generates the corresponding time interval.
RDCap - In this task, a model must generate a detailed description of all events involving a specific subject of interest (e.g., a person, animal, or object) in a video. Given a video and a region masklet, the model must produce a sequence of (start, end, caption) tuples that cover the entire duration of the video, including periods when the subject is not visible. We report SODA score, which leverages an LLM judge to assess the quality of the generated captions.

GitHub: https://github.com/facebookresearch/perception_models

Models: PLM weights are available in 1B, 3B and 8B size.

PLM Image and Video Benchmark Results

Please refer to Table # 3 and 4 on the PLM Paper for a detailed comparison with other MLLM baselines.

PLM VideoBench Results

Our implementation successfully reproduce the PLM-VideoBench results reported in the paper. We also provide detailed instructions for evaluating in PLM on PLM-VideoBench in at lmms_eval/tasks/plm_videobench/README.md.

Model	Method	FGQA (MBacc)	SGQA (Acc)	RTLoc (meanR)	Avg.
PLM-8B	Reported in the paper	67.7	46.2	59.1	57.6
PLM-8B	Reproduce using lmms-eval	67.9	45.6	59.7	57.7

The PR also updates some task configs to add pre and post prompts for PLM. Additionally, it adds yerevann/coco-karpathy to lmms-eval as well.

Please refer to perception_models/apps/plm/docs/evaluation.md for instructions on evaluating PLM on multiple image and video benchmarks using lmms-eval.

The pull request does not break any existing functionality in lmms-eval.

…a and Models for Detailed Visual Understanding'

Luodian · 2025-04-18T04:28:39Z

Great work!

mmaaz60 · 2025-04-18T17:16:23Z

Hi @Luodian,

Thank you for reviewing and merging the pull request. I really appreciate it. I wanted to ask if you can update the main README announcing that Perception Language Model and PLM-VideoBench have added to lmms-eval? Thanks.

Luodian · 2025-04-19T04:40:13Z

Hi @Luodian,

Thank you for reviewing and merging the pull request. I really appreciate it. I wanted to ask if you can update the main README announcing that Perception Language Model and PLM-VideoBench have added to lmms-eval? Thanks.

Yes sure, welcome announce it here~

* Implements PLM and PLM-VideoBench from 'PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding' * Updates docs. * Removes redundant code.

mmaaz60 added 3 commits April 17, 2025 14:16

Implements PLM and PLM-VideoBench from 'PerceptionLM: Open-Access Dat…

54ad0f8

…a and Models for Detailed Visual Understanding'

Updates docs.

826dfaa

Removes redundant code.

eefa2aa

Luodian approved these changes Apr 18, 2025

View reviewed changes

Luodian self-assigned this Apr 18, 2025

Luodian merged commit 9596fbd into EvolvingLMMs-Lab:main Apr 18, 2025
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds PerceptionLM and PLM-VideoBench #638

Adds PerceptionLM and PLM-VideoBench #638

Uh oh!

mmaaz60 commented Apr 17, 2025 •

edited

Loading

Uh oh!

Luodian commented Apr 18, 2025

Uh oh!

Uh oh!

mmaaz60 commented Apr 18, 2025

Uh oh!

Luodian commented Apr 19, 2025

Uh oh!

Uh oh!

Adds PerceptionLM and PLM-VideoBench #638

Adds PerceptionLM and PLM-VideoBench #638

Uh oh!

Conversation

mmaaz60 commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PLM Image and Video Benchmark Results

PLM VideoBench Results

Uh oh!

Luodian commented Apr 18, 2025

Uh oh!

Uh oh!

mmaaz60 commented Apr 18, 2025

Uh oh!

Luodian commented Apr 19, 2025

Uh oh!

Uh oh!

mmaaz60 commented Apr 17, 2025 •

edited

Loading