Skip to content

Conversation

@ritwickchaudhry
Copy link
Contributor

This PR fixes an issue in Qwen2VLImageProcessor where the current implementation does not correctly handle cases when the number of video frames is not divisible by temporal_patch_size.

Problem:

The existing logic repeats the last frame temporal_patch_size - 1 times. This works correctly when temporal_patch_size equals 2 but fails when the size is greater.

Solution:

The fix replaces:

repeats = np.repeat(patches[-1][np.newaxis], temporal_patch_size - 1, axis=0)
with:

repeats = np.repeat(patches[-1][np.newaxis], temporal_patch_size - (patches.shape[0] % temporal_patch_size), axis=0)

This ensures that the correct number of padding frames are added when the frame count is not divisible by the temporal_patch_size.

Additional Changes:

Added a unit test to verify the padding logic for edge cases where the number of frames is not divisible by the patch size.

Issue Reference:

Fixes #38003

@github-actions
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@github-actions github-actions bot marked this pull request as draft May 12, 2025 07:01
@ritwickchaudhry
Copy link
Contributor Author

@zucchini-nlp Could you please review this PR?

@ritwickchaudhry ritwickchaudhry marked this pull request as ready for review May 12, 2025 07:07
@github-actions github-actions bot requested review from qubvel and ydshieh May 12, 2025 07:07
@ritwickchaudhry ritwickchaudhry force-pushed the fix-qwen2vl-temporal-padding branch from f876f69 to 9bd3e16 Compare May 12, 2025 07:12
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


# Check the shape after padding
expected_output_video_shape = (102900, 1176) # Adjusted based on padding
self.assertEqual(tuple(encoded_video.shape), expected_output_video_shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultra nit: asserting ListEqual can give more informative error output when tests fail :)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp merged commit fe918d1 into huggingface:main May 14, 2025
11 checks passed
zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request May 14, 2025
…es is not divisible by temporal_patch_size (huggingface#38076)

Qwen2VL: Fix temporal padding in Qwen2VLImageProcessor when frames are not divisible by temporal_patch_size
@yaogang2060
Copy link
Contributor

qwen3vl_video_processor has same problem....

repeats = patches[:, -1:].repeat(1, temporal_patch_size - 1, 1, 1, 1)

@zucchini-nlp
Copy link
Member

@yaogang2060 can you submit a PR and tag me pls?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential bug in Qwen 2/2.5 VL Image Preprocessor

4 participants