Skip to content

[NVPTX] Improve lowering of v4i8 #67866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Oct 9, 2023
Merged

[NVPTX] Improve lowering of v4i8 #67866

merged 13 commits into from
Oct 9, 2023

Conversation

Artem-B
Copy link
Member

@Artem-B Artem-B commented Sep 29, 2023

Make it a legal type and plumb through lowering of relevant instructions.

Make it a legal type and plumb through lowering of relevant instructions.
Copy link
Contributor

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually trying out the patch in triton it causes some invalid ptx to get emitted (see comment)

Verified that NVPTX tests pass with ptxas being able to compiler PTX produced by
llc tests.
To make things work consisstently for v4i8, we need to implement other vector
ops.
@github-actions
Copy link

github-actions bot commented Oct 5, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@Artem-B Artem-B requested a review from ThomasRaoux October 5, 2023 22:44
@Artem-B
Copy link
Member Author

Artem-B commented Oct 5, 2023

I still need to test it on live code, though we do not have much code that would end up using v4i8.

The generated PTX checked in llvm/test/CodeGen/NVPTX/i8x4-instructions.ll could use extra scrutiny.

@ThomasRaoux
Copy link
Contributor

I ran the patch on our triton kernels and I don't see any functional problems left.

Copy link
Contributor

@ThomasRaoux ThomasRaoux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it required quite a lot of cases to be handled :(
Thanks for doing this, it solves some of the problems triton had with latest LLVM. Changes look good to me.

@Artem-B
Copy link
Member Author

Artem-B commented Oct 6, 2023

I see one suspicious failure in tensorflow tests. I suspect I've messed something up in v4i8 comparison.

@Artem-B Artem-B requested a review from d0k October 6, 2023 17:43
@Artem-B
Copy link
Member Author

Artem-B commented Oct 6, 2023

I see one suspicious failure in tensorflow tests. I suspect I've messed something up in v4i8 comparison.

Yup, there is a problem:

Successfully custom legalized node
 ... replacing: t10: v4i8 = BUILD_VECTOR Constant:i16<-128>, Constant:i16<-128>, Constant:i16<-128>, Constant:i16<-128>
     with:      t17: v4i8 = bitcast Constant:i32<-128>

@Artem-B
Copy link
Member Author

Artem-B commented Oct 6, 2023

I see one suspicious failure in tensorflow tests. I suspect I've messed something up in v4i8 comparison.

Yup, there is a problem:

Successfully custom legalized node
 ... replacing: t10: v4i8 = BUILD_VECTOR Constant:i16<-128>, Constant:i16<-128>, Constant:i16<-128>, Constant:i16<-128>
     with:      t17: v4i8 = bitcast Constant:i32<-128>

Resolved by 9821e90

@Artem-B
Copy link
Member Author

Artem-B commented Oct 6, 2023

Found another issue. We merge four independent byte loads with align 1 into a 32-bit load, which fails at runtime on misaligned pointers.

%t0 = type { [17 x i8] }

@shared_storage = linkonce_odr local_unnamed_addr addrspace(3) global %t0 undef, align 1

define <4 x i8> @in_v4i8(<4 x i8> %x, <4 x i8> %y) nounwind {
  %v = load <4 x i8>, ptr getelementptr inbounds (i8, ptr addrspacecast (ptr addrspace(3) @shared_storage to ptr), i64 9), align 1
  ret <4 x i8> %v
}
        mov.u64         %rd1, shared_storage;
        cvta.shared.u64         %rd2, %rd1;
        ld.u32  %r1, [%rd2+9];
        st.param.b32    [func_retval0+0], %r1;
        ret;

@Artem-B
Copy link
Member Author

Artem-B commented Oct 9, 2023

clang-format failure on GitHub is weird -- it just silently exits with an error.
I ran the same command locally and fixed one place it was not happy about.

The buildkite failure somewhere in RISC-V appears to be unrelated.

@abadams
Copy link

abadams commented Oct 15, 2023

I believe this may be causing failures for u/srem. See #69124

Edit: Also causing failures when you sign-extend the result of a <4 x i8> comparison.

justinfargnoli added a commit that referenced this pull request Oct 10, 2024
In [[NVPTX] Improve lowering of
v4i8](cbafb6f)
@Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX
instructions. @Artem-B did this because:
([source](#67866 (comment)))

> Under the hood byte extraction/insertion ends up as BFI/BFE
instructions, so we may as well do that in PTX, too.
https://godbolt.org/z/Tb3zWbj9b

However, the example that @Artem-B linked was targeting sm_52. On modern
architectures, ptxas uses prmt.b32.
[Example](https://godbolt.org/z/Ye4W1n84o).

Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.
ericastor pushed a commit to ericastor/llvm-project that referenced this pull request Oct 10, 2024
In [[NVPTX] Improve lowering of
v4i8](llvm@cbafb6f)
@Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX
instructions. @Artem-B did this because:
([source](llvm#67866 (comment)))

> Under the hood byte extraction/insertion ends up as BFI/BFE
instructions, so we may as well do that in PTX, too.
https://godbolt.org/z/Tb3zWbj9b

However, the example that @Artem-B linked was targeting sm_52. On modern
architectures, ptxas uses prmt.b32.
[Example](https://godbolt.org/z/Ye4W1n84o).

Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.
DanielCChen pushed a commit to DanielCChen/llvm-project that referenced this pull request Oct 16, 2024
In [[NVPTX] Improve lowering of
v4i8](llvm@cbafb6f)
@Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX
instructions. @Artem-B did this because:
([source](llvm#67866 (comment)))

> Under the hood byte extraction/insertion ends up as BFI/BFE
instructions, so we may as well do that in PTX, too.
https://godbolt.org/z/Tb3zWbj9b

However, the example that @Artem-B linked was targeting sm_52. On modern
architectures, ptxas uses prmt.b32.
[Example](https://godbolt.org/z/Ye4W1n84o).

Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.
justinfargnoli added a commit that referenced this pull request Oct 31, 2024
Fix
[failure](#110766 (comment))
identified by @akuegel.

---

In [[NVPTX] Improve lowering of
v4i8](cbafb6f)
@Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX
instructions. @Artem-B did this because:
(#67866 (comment))

Under the hood byte extraction/insertion ends up as BFI/BFE
instructions, so we may as well do that in PTX, too.
https://godbolt.org/z/Tb3zWbj9b

However, the example that @Artem-B linked was targeting sm_52. On modern
architectures, ptxas uses prmt.b32.
[Example](https://godbolt.org/z/Ye4W1n84o).

Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.
smallp-o-p pushed a commit to smallp-o-p/llvm-project that referenced this pull request Nov 3, 2024
Fix
[failure](llvm#110766 (comment))
identified by @akuegel.

---

In [[NVPTX] Improve lowering of
v4i8](llvm@cbafb6f)
@Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX
instructions. @Artem-B did this because:
(llvm#67866 (comment))

Under the hood byte extraction/insertion ends up as BFI/BFE
instructions, so we may as well do that in PTX, too.
https://godbolt.org/z/Tb3zWbj9b

However, the example that @Artem-B linked was targeting sm_52. On modern
architectures, ptxas uses prmt.b32.
[Example](https://godbolt.org/z/Ye4W1n84o).

Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.
NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this pull request Nov 4, 2024
Fix
[failure](llvm#110766 (comment))
identified by @akuegel.

---

In [[NVPTX] Improve lowering of
v4i8](llvm@cbafb6f)
@Artem-B add the ability to lower ISD::BUILD_VECTOR with bfi PTX
instructions. @Artem-B did this because:
(llvm#67866 (comment))

Under the hood byte extraction/insertion ends up as BFI/BFE
instructions, so we may as well do that in PTX, too.
https://godbolt.org/z/Tb3zWbj9b

However, the example that @Artem-B linked was targeting sm_52. On modern
architectures, ptxas uses prmt.b32.
[Example](https://godbolt.org/z/Ye4W1n84o).

Thus, remove uses of NVPTXISD::BFI in favor of NVPTXISD::PRMT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants