-
Notifications
You must be signed in to change notification settings - Fork 72
Fix handling of attention-bias in MHA fusion #2332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ganesan Ramalingam <[email protected]>
Signed-off-by: Ganesan Ramalingam <[email protected]>
Signed-off-by: Ganesan Ramalingam <[email protected]>
❌ 3 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances attention-bias (mask) handling in the MHA fusion by enforcing ORT contrib ops’ mask shape requirements and expanding 2D masks for broadcasting.
- Adds shape checks to ensure masks are 2D or 4D with broadcastable first two dims
- Tracks when mask broadcast is needed via
_use_mask_broadcast
- Inserts an
Expand
inrewrite()
to reshape 2D masks to 4D for MultiHeadAttention
Comments suppressed due to low confidence (1)
onnxscript/rewriter/ort_fusions/mha.py:285
- [nitpick] The name
mask_dim_2
is ambiguous; consider renaming it to something more descriptive likemask_seq_len_dim
ormask_S_or_1
to clarify that this binding holds the S-or-1 dimension.
mask_dim_2 = bindings.get("S_or_1")
In models generated from pytorch, masks may have shapes that are broadcastable to (B, H, S, St): eg., a 2D mask of shape (S, St) or even shape (1, 1, 1, St) in one example.
ONNX's opset23 Attention op allows masks of this shape. However, ORT's contrib ops (MHA, Attention) allow a mask of shape (1 or B, 1 or H, S, St). That is: they support broadcast only for the first two dimensions. (Even that is not supported by some earlier versions of ORT, which we don't consider here.)
So, while doing fusion for MHA, we should expand the mask to ensure it satisfies the constraints of MHA/Attention.