Add ALL_ATTENTION_FUNCTIONS compatibility for Pixtral model #37960

uminaty · 2025-05-05T13:42:58Z

What does this PR do?

This PR adds support for ALL_ATTENTION_FUNCTIONS to the Pixtral model’s attention mechanism. I added and verified compatibility with sdpa, flash_attention_2, and flex_attention. Since Pixtral also serves as the vision tower in Mistral 3.1, users can now set the entire model to use flash_attention_2.

I tried to follow the implementation pattern of other models using this interface. For flash_attention_2, I reused position_ids because the existing attention mask shape isn’t supported. Since Pixtral uses sequence packing and already generates position_ids, we leverage prepare_fa2_from_position_ids instead of a mask.

I tested these changes in training and inference: losses match very closely and we observe a 10–25 % throughput improvement depending on the setup.

Who can review?

github-actions · 2025-05-05T13:43:21Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

Disallow using sdpa and output_attentions

qubvel

@uminaty thanks for the PR! please use library-defined functions as much as possible 🤗 Thank you!

src/transformers/models/pixtral/modeling_pixtral.py

…n from mistral

uminaty · 2025-05-05T21:16:21Z

Thanks @qubvel for the review 🙏! I made the changes you suggested, let me know if anything else is needed.

ArthurZucker

Great addition thanks! We don't really need the position ids, (should be kwargs imo!)

src/transformers/models/pixtral/modeling_pixtral.py

zucchini-nlp

Oh cool, I also had a PR for attention in VLMs in #37576 😄

src/transformers/models/pixtral/modeling_pixtral.py

HuggingFaceDocBuilderDev · 2025-05-06T13:04:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

uminaty · 2025-05-06T14:19:13Z

Thanks everyone for your reviews! Let me know if anything else is needed before merging 😊

ArthurZucker · 2025-05-08T10:13:17Z

Thanks for the contrib!

…ace#37960) * Add ALL_ATTENTION_FUNCTIONS compatibility for Pixtral model * Fix invalid operand type * Allow image_sizes to be optional in forward pass to fit tests Disallow using sdpa and output_attentions * Disallow using sdpa with output_attentions * Delete useless comments, use eager attention from smolvlm, use pattern from mistral * add _supports_attention_backend * use kwargs instead of position_ids --------- Co-authored-by: aurelien.lac <[email protected]>

Add ALL_ATTENTION_FUNCTIONS compatibility for Pixtral model

c67c173

github-actions bot marked this pull request as draft May 5, 2025 13:43

uminaty marked this pull request as ready for review May 5, 2025 13:43

github-actions bot requested review from ArthurZucker and zucchini-nlp May 5, 2025 13:44

Fix invalid operand type

54b71c2

uminaty force-pushed the pixtral-all-attn branch from 7c43f75 to 54b71c2 Compare May 5, 2025 13:57

aurelien.lac added 2 commits May 5, 2025 18:51

Allow image_sizes to be optional in forward pass to fit tests

0d7a1b7

Disallow using sdpa and output_attentions

Disallow using sdpa with output_attentions

71827ac

uminaty force-pushed the pixtral-all-attn branch from 20f777d to 71827ac Compare May 5, 2025 16:52

qubvel reviewed May 5, 2025

View reviewed changes

Delete useless comments, use eager attention from smolvlm, use patter…

9e78cee

…n from mistral

ArthurZucker approved these changes May 6, 2025

View reviewed changes

src/transformers/models/pixtral/modeling_pixtral.py Outdated Show resolved Hide resolved

uminaty force-pushed the pixtral-all-attn branch from 9e78cee to 50cc674 Compare May 6, 2025 09:30

zucchini-nlp reviewed May 6, 2025

View reviewed changes

src/transformers/models/pixtral/modeling_pixtral.py Show resolved Hide resolved

src/transformers/models/pixtral/modeling_pixtral.py Outdated Show resolved Hide resolved

add _supports_attention_backend

9503c77

uminaty force-pushed the pixtral-all-attn branch from 50cc674 to 9503c77 Compare May 6, 2025 10:32

use kwargs instead of position_ids

7457672

ArthurZucker merged commit f6664ee into huggingface:main May 8, 2025
14 checks passed

uminaty deleted the pixtral-all-attn branch May 8, 2025 20:47

uminaty mentioned this pull request May 15, 2025

Hotfix: Flash Attention 2 support in Pixtral #38146

Merged

Add ALL_ATTENTION_FUNCTIONS compatibility for Pixtral model #37960

Add ALL_ATTENTION_FUNCTIONS compatibility for Pixtral model #37960

Uh oh!

Conversation

uminaty commented May 5, 2025

What does this PR do?

Who can review?

Uh oh!

github-actions bot commented May 5, 2025

Uh oh!

qubvel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

uminaty commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented May 6, 2025

Uh oh!

uminaty commented May 6, 2025

Uh oh!

Uh oh!

ArthurZucker commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

uminaty commented May 5, 2025 •

edited

Loading