Skip to content

AMDGPU: Update pattern matching from "x&(-1>>(32-y))" to "bfe x, 0, y" #116115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 14, 2024

Conversation

changpeng
Copy link
Contributor

@changpeng changpeng commented Nov 13, 2024

It is not correct to lower "x&(-1>>(32-y))" to "bfe x, 0, y". When y equals 32, "-1" is not shifted, so x&(-1>>(32-32) is still x, but "bfe x, 0, 32" is 0. However, if we know y is at most of 5 bits (< 32), we can still do the pattern matching.

  It is not correct to lower x&(-1>>(32-y) to "bfe x, 0, y". When y
equals to 32, "-1" is not shifted, so x&(-1>>(32-32) is still x, but
"bfe x, 0, 32" is 0.
@llvmbot
Copy link
Member

llvmbot commented Nov 13, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Changpeng Fang (changpeng)

Changes

It is not correct to lower x&(-1>>(32-y) to "bfe x, 0, y". When y equals 32, "-1" is not shifted, so x&(-1>>(32-32) is still x, but "bfe x, 0, 32" is 0.


Full diff: https://github.com/llvm/llvm-project/pull/116115.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (-6)
  • (modified) llvm/test/CodeGen/AMDGPU/extract-lowbits.ll (+30-10)
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 5f4cca0645b0ef..2831d0339df6b9 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -3550,12 +3550,6 @@ def : AMDGPUPat <
   (V_BFE_U32_e64 $src, (i32 0), $width)
 >;
 
-// x & (-1 >> (bitwidth - y))
-def : AMDGPUPat <
-  (DivergentBinFrag<and> i32:$src, (srl_oneuse -1, (sub 32, i32:$width))),
-  (V_BFE_U32_e64 $src, (i32 0), $width)
->;
-
 // SHA-256 Ma patterns
 
 // ((x & z) | (y & (x | z))) -> BFI (XOR x, y), z, y
diff --git a/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll b/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
index 7f1f7133d69919..cc2ef0c48f1152 100644
--- a/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
+++ b/llvm/test/CodeGen/AMDGPU/extract-lowbits.ll
@@ -99,11 +99,21 @@ define i32 @bzhi32_b4_commutative(i32 %val, i32 %numlowbits) nounwind {
 ; ---------------------------------------------------------------------------- ;
 
 define i32 @bzhi32_c0(i32 %val, i32 %numlowbits) nounwind {
-; GCN-LABEL: bzhi32_c0:
-; GCN:       ; %bb.0:
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GCN-NEXT:    v_bfe_u32 v0, v0, 0, v1
-; GCN-NEXT:    s_setpc_b64 s[30:31]
+; SI-LABEL: bzhi32_c0:
+; SI:       ; %bb.0:
+; SI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; SI-NEXT:    v_sub_i32_e32 v1, vcc, 32, v1
+; SI-NEXT:    v_lshr_b32_e32 v1, -1, v1
+; SI-NEXT:    v_and_b32_e32 v0, v1, v0
+; SI-NEXT:    s_setpc_b64 s[30:31]
+;
+; VI-LABEL: bzhi32_c0:
+; VI:       ; %bb.0:
+; VI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; VI-NEXT:    v_sub_u32_e32 v1, vcc, 32, v1
+; VI-NEXT:    v_lshrrev_b32_e64 v1, v1, -1
+; VI-NEXT:    v_and_b32_e32 v0, v1, v0
+; VI-NEXT:    s_setpc_b64 s[30:31]
   %numhighbits = sub i32 32, %numlowbits
   %mask = lshr i32 -1, %numhighbits
   %masked = and i32 %mask, %val
@@ -134,11 +144,21 @@ define i32 @bzhi32_c1_indexzext(i32 %val, i8 %numlowbits) nounwind {
 }
 
 define i32 @bzhi32_c4_commutative(i32 %val, i32 %numlowbits) nounwind {
-; GCN-LABEL: bzhi32_c4_commutative:
-; GCN:       ; %bb.0:
-; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GCN-NEXT:    v_bfe_u32 v0, v0, 0, v1
-; GCN-NEXT:    s_setpc_b64 s[30:31]
+; SI-LABEL: bzhi32_c4_commutative:
+; SI:       ; %bb.0:
+; SI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; SI-NEXT:    v_sub_i32_e32 v1, vcc, 32, v1
+; SI-NEXT:    v_lshr_b32_e32 v1, -1, v1
+; SI-NEXT:    v_and_b32_e32 v0, v0, v1
+; SI-NEXT:    s_setpc_b64 s[30:31]
+;
+; VI-LABEL: bzhi32_c4_commutative:
+; VI:       ; %bb.0:
+; VI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; VI-NEXT:    v_sub_u32_e32 v1, vcc, 32, v1
+; VI-NEXT:    v_lshrrev_b32_e64 v1, v1, -1
+; VI-NEXT:    v_and_b32_e32 v0, v0, v1
+; VI-NEXT:    s_setpc_b64 s[30:31]
   %numhighbits = sub i32 32, %numlowbits
   %mask = lshr i32 -1, %numhighbits
   %masked = and i32 %val, %mask ; swapped order

  It is fine to match the pattern if we know y has at most five
active bits (< 32).
@changpeng changpeng changed the title AMDGPU: Disable pattern matching from x&(-1>>(32-y) to "bfe x, 0, y" AMDGPU: Update pattern matching from x&(-1>>(32-y) to "bfe x, 0, y" Nov 14, 2024
@jayfoad
Copy link
Contributor

jayfoad commented Nov 14, 2024

Mismatched parens in both summary and description: x&(-1>>(32-y)

@changpeng changpeng changed the title AMDGPU: Update pattern matching from x&(-1>>(32-y) to "bfe x, 0, y" AMDGPU: Update pattern matching from "x&(-1>>(32-y))" to "bfe x, 0, y" Nov 14, 2024
@changpeng
Copy link
Contributor Author

Mismatched parens in both summary and description: x&(-1>>(32-y)
Thanks!

@rovka
Copy link
Collaborator

rovka commented Nov 14, 2024

Hi, I like your patch but I have a couple of questions:

  • Where does it say that v_bfe_u32 x, 0, 32 is 0? The description in the RDNA3 ISA says D0.u = ((S0.u >> S1.u[4 : 0].u) & 32'U((1 << S2.u[4 : 0].u) - 1)), which is unfortunately not very specific about edge cases. I would interpret it as large values cause the 1 to be shifted out, then when we subtract 1 we get 0xFFF...F, so when we do & with it we essentially get x. If the hardware actually returns 0, we should probably talk to folks to make that clear in the ISA.
  • While you're at it, what do v_bfe_i32 and s_bfe_* do and do we have proper tests for them?

@jayfoad
Copy link
Contributor

jayfoad commented Nov 14, 2024

The description in the RDNA3 ISA says D0.u = ((S0.u >> S1.u[4 : 0].u) & 32'U((1 << S2.u[4 : 0].u) - 1))

This says that it only looks at bits 4:0 of S2, so 32 would be treated the same as 0.

@rovka
Copy link
Collaborator

rovka commented Nov 14, 2024

The description in the RDNA3 ISA says D0.u = ((S0.u >> S1.u[4 : 0].u) & 32'U((1 << S2.u[4 : 0].u) - 1))

This says that it only looks at bits 4:0 of S2, so 32 would be treated the same as 0.

Right! Thanks.

@changpeng
Copy link
Contributor Author

changpeng commented Nov 14, 2024 via email

@changpeng changpeng merged commit e3e7c75 into llvm:main Nov 14, 2024
6 of 8 checks passed
@changpeng changpeng deleted the bfe branch November 14, 2024 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants