Fix cross attention in MHA (#2337)

gramalingam · justinchuby · web-flow · commit 5a8b9e616ead · 2025-05-28T01:01:04.000Z
Fix a seeming bug in handling of cross-attention in MHA (to be
verified): In MHA fusion, we start with an input graph where attention
is applied to 4D query/key/value, and it is transformed into a MHA op on
3D query/key/value.

In the case of cross-attention (with no rotary-embedding): the fusion
seems to convert just query to 3D, and seems to leave key and value as
4D, which seems wrong.

This PR adds the necessary 4D=&gt;3D conversion for key/value before MHA.

Note: This is a quick fix for the relevant case (that shows up). Other
combinations may be worth checking out separately.

---------

Signed-off-by: Ganesan Ramalingam &lt;grama@microsoft.com&gt;
Co-authored-by: Justin Chu &lt;justinchuby@users.noreply.github.com&gt;
diff --git a/onnxscript/rewriter/ort_fusions/mha.py b/onnxscript/rewriter/ort_fusions/mha.py
@@ -349,6 +349,13 @@ def rewrite(
                 )
             else:
                 key_BSD_emb = key
+        elif self._is_cross_attention:
+            query_BSD_emb = query_BSD
+            # Must convert key/value from 4D to 3D for use in MHA
+            key = op.Transpose(key, perm=[0, 2, 1, 3])
+            key_BSD_emb = op.Reshape(key, op.Constant(value_ints=[0, 0, -1]))
+            value = op.Transpose(value, perm=[0, 2, 1, 3])
+            value = op.Reshape(value, op.Constant(value_ints=[0, 0, -1]))
         else:
             query_BSD_emb = query_BSD
             key_BSD_emb = key