Skip to content

[NVPTX] Preserve v16i8 vector loads when legalizing #67322

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

pasaulais
Copy link
Contributor

@pasaulais pasaulais commented Sep 25, 2023

The Load and Store Vectorizer pass combines sequential loads into vector loads, e.g. 16 i8 loads into a v16i8 vector load. Such loads are not legal PTX but v4i32 (same vector width) is. This resulted in assembly like the following when every lane in the vector is extracted and extended:

	ld.v4.u8 	{%rs1, %rs2, %rs3, %rs4}, [%rd2+12];
	ld.v4.u8 	{%rs5, %rs6, %rs7, %rs8}, [%rd2+8];
	ld.v4.u8 	{%rs9, %rs10, %rs11, %rs12}, [%rd2+4];
	ld.v4.u8 	{%rs13, %rs14, %rs15, %rs16}, [%rd2];

These changes add a new DAG combine operation for v16i8 loads, which replaces the original load by a v4i32 load and a bitcast.

This results in the following code:

	ld.v4.u32 	{%r1, %r2, %r3, %r4}, [%rd2];

v16i8 loads are lowered into LoadV4 operations with i32 results instead of letting ReplaceLoadVector split it into smaller loads during legalization. This is done at dag-combine1 time, so that vector operations with i8 elements can be optimized away instead of being needlessly split during legalization.

@github-actions
Copy link

github-actions bot commented Sep 25, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@pasaulais pasaulais force-pushed the nvptx-opt-v16i8-reduction branch from e3702e7 to 012e4cc Compare September 25, 2023 12:52
// elements can be optimised away instead of being needlessly split during
// legalization, which involves storing to the stack and loading it back.
EVT VT = N->getValueType(0);
if (VT != MVT::v16i8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to generalize it to v8i8, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that would work as well as v16i8, since I don't think there is a ld.v2.b32 instruction we could use. It would mean having to create two NVPTXISD::LoadV* nodes here and duplicating some code from ReplaceLoadVector.

By the way, I have also tried to do this change in ReplaceLoadVector instead of adding a DAG combine for LOAD nodes. I backtracked as this was creating stack operations. I didn't check again after your recent commit was merged, but maybe that works better now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a ld.v2.b32 instruction we could use

V2 ld/st variants do exist:

def _v2_avar : NVPTXInst<

The code is easily parametrizable by NumElts.

@Artem-B
Copy link
Member

Artem-B commented Sep 26, 2023

On the second thought we may be papering over the real problem that LLVM right now ends up generating rather slow code when we need to do anything with v4i8. If we improve that, then special-casing lowering of v16i8/v8i8 etc will become unnecessary.

I've started working on this, I should have an idea withing a day or two whether the general improvement in v4i8 lowering would be sufficient to address this particular scenario, too.

@pasaulais
Copy link
Contributor Author

Yes, having to keep adding DAG combines to avoid the legalizer creating less efficient code is not ideal. Could we make v4i8 a legal type? That's probably more changes to the backend overall (and having to mark a whole lot of operations as Expand), but might be cleaner/easier to optimize.

@Artem-B
Copy link
Member

Artem-B commented Sep 26, 2023

Could we make v4i8 a legal type?

That's roughly the idea.

@pasaulais
Copy link
Contributor Author

Is it worth separating out the i8 extraction part from these changes and only keep the v16i8 load part for now? I have not updated this PR with your suggestions yet as it may need reworking anyway after your v4i8 work

@Artem-B
Copy link
Member

Artem-B commented Sep 28, 2023

whether the general improvement in v4i8 lowering would be sufficient to address this particular scenario, too.

Well, that brought less benefit than I hoped for. Making v4i8 a legal type helps to avoid issues in other areas but does not help much with this particular case. Your changes are still useful and needed.

Is it worth separating out the i8 extraction part from these changes and only keep the v16i8 load part for now? I have not updated this PR with your suggestions yet as it may need reworking anyway after your v4i8 work

Considering that these improvements are independent, it may indeed be a good idea to split them into separate patches.

@pasaulais pasaulais changed the title [NVPTX] Optimize v16i8 reductions [NVPTX] Preserve v16i8 vector loads when legalizing Sep 29, 2023
@pasaulais pasaulais force-pushed the nvptx-opt-v16i8-reduction branch from 012e4cc to 99659f6 Compare September 29, 2023 20:23
@pasaulais
Copy link
Contributor Author

whether the general improvement in v4i8 lowering would be sufficient to address this particular scenario, too.

Well, that brought less benefit than I hoped for. Making v4i8 a legal type helps to avoid issues in other areas but does not help much with this particular case. Your changes are still useful and needed.

Is it worth separating out the i8 extraction part from these changes and only keep the v16i8 load part for now? I have not updated this PR with your suggestions yet as it may need reworking anyway after your v4i8 work

Considering that these improvements are independent, it may indeed be a good idea to split them into separate patches.

Thanks. I have integrated your suggestions into the first commit (preserving v16i8 loads) and removed the second commit from this PR. I'll create a new PR for the i8 extraction DAG combine when this one is merged.

@Artem-B
Copy link
Member

Artem-B commented Sep 29, 2023

Here's my PR for making v4i8 legal: #67866

It does improve some cases where we were previously stuck with ld/st.v4.i8 (see unfold-masked-merge-vector-variablemask.ll in the PR above), but, AFAICT, does not help with your test case combine_v16i8.

Update: My PR does help with v8i16, just not in your example, for some reason. In the test case @out_v16i8 in unfold-masked-merge-vector-variablemask.ll we now vectorize loading of v16i8 arguments into:

define <16 x i8> @out_v16i8(<16 x i8> %x, <16 x i8> %y, <16 x i8> %mask)
; CHECK-NEXT:    ld.param.v4.u32 {%r1, %r2, %r3, %r4}, [out_v16i8_param_1];
; CHECK-NEXT:    ld.param.v4.u32 {%r9, %r10, %r11, %r12}, [out_v16i8_param_2];
; CHECK-NEXT:    ld.param.v4.u32 {%r17, %r18, %r19, %r20}, [out_v16i8_param_0];

@pasaulais pasaulais force-pushed the nvptx-opt-v16i8-reduction branch 3 times, most recently from 488fac3 to 3349110 Compare September 30, 2023 09:46
This is done by lowering v16i8 loads into LoadV4 operations with i32
results instead of letting ReplaceLoadVector split it into smaller
loads during legalization. This is done at dag-combine1 time, so that
vector operations with i8 elements can be optimised away instead of
being needlessly split during legalization, which involves storing to
the stack and loading it back.
@pasaulais pasaulais force-pushed the nvptx-opt-v16i8-reduction branch from 3349110 to a52cf9a Compare October 11, 2023 15:29
@pasaulais
Copy link
Contributor Author

Sorry for the delay @Artem-B, I have been unexpectedly OoO last week. I have addressed your comments and all the checks are passing after rebasing, could you merge this commit? I don't have commit access to this repo.

@pasaulais pasaulais requested a review from Artem-B October 18, 2023 10:07
// elements can be optimised away instead of being needlessly split during
// legalization, which involves storing to the stack and loading it back.
EVT VT = N->getValueType(0);
if (VT != MVT::v16i8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a ld.v2.b32 instruction we could use

V2 ld/st variants do exist:

def _v2_avar : NVPTXInst<

The code is easily parametrizable by NumElts.

@pasaulais
Copy link
Contributor Author

I didn't know about v2 variants, thanks. This is something I'd rather revisit in a future PR as this is quite an old patch now and I still need to create a PR for the second part of the original changes once this is merged.

@ldrumm
Copy link
Contributor

ldrumm commented Oct 19, 2023

To github.com:llvm/llvm-project.git                                                                                                                            
   906d3ff054b0..0b80288e9e0b  main -> main 

@ldrumm ldrumm closed this Oct 19, 2023
@bondhugula
Copy link
Contributor

bondhugula commented Nov 23, 2023

@pasaulais @Artem-B These are all really useful changes for v16i8 - thanks! While experimenting, I found a related scenario that has been missed. While the current trunk ensures v4.u32 is used for 16 x i8 loads, this isn't the case for 16 x i8 stores that don't have a load feeding into them. The simplest test case is below. llc generates four st.u32 here - while in theory, v4.store.b32 or v4.store.u32 could have been used. I can contribute a patch to make this efficient as well, with this same approach, if that makes sense. store v16 x i8 -> bitcast v16i8 to v4 x (v4 x i8) + store v4 x i32.

; RUN: llc < %s -march=nvptx -mcpu=sm_20 | FileCheck %s

; CHECK-LABEL: foo3
define void @foo3(ptr %a, ptr %b, <16 x i8> %v) {
; CHECK: st.v4.b32
  store <16 x i8> %v, ptr %b
  ret void
}
//
// Generated by LLVM NVPTX Back-End
//

.version 7.0
.target sm_80
.address_size 32

	// .globl	foo3                    // -- Begin function foo3
                                        // @foo3
.visible .func foo3(
	.param .b32 foo3_param_0,
	.param .b32 foo3_param_1,
	.param .align 16 .b8 foo3_param_2[16]
)
{
	.reg .b32 	%r<6>;

// %bb.0:
	ld.param.u32 	%r1, [foo3_param_1];
	ld.param.v4.u32 	{%r2, %r3, %r4, %r5}, [foo3_param_2];
	st.u32 	[%r1+12], %r5;
	st.u32 	[%r1+8], %r4;
	st.u32 	[%r1+4], %r3;
	st.u32 	[%r1], %r2;
	ret;
                                        // -- End function

On a related note, the following patch is a much smaller unit test case for this PR:

--- a/llvm/test/CodeGen/NVPTX/vector-stores.ll
+++ b/llvm/test/CodeGen/NVPTX/vector-stores.ll
@@ -29,3 +29,11 @@ define void @foo4(<4 x i32> %val, ptr %ptr) {
   ret void
 }
 
+; CHECK-LABEL: @v16i8
+define void @v16i8(ptr %a, ptr %b) {
+; CHECK: ld.v4.u32
+; CHECK: st.v4.u32
+  %v = load <16 x i8>, ptr %a
+  store <16 x i8> %v, ptr %b
+  ret void
+}

@Artem-B
Copy link
Member

Artem-B commented Nov 24, 2023

SGTM. Bonus points for covering v8i8, too.

@bondhugula
Copy link
Contributor

SGTM. Bonus points for covering v8i8, too.

Here it is: #73646

@pasaulais
Copy link
Contributor Author

Thqanks @bondhugula! It's great to have this for stores too and a much more concise reproducer

@dakersnar
Copy link
Contributor

Hi @Artem-B, I'm looking into filling out some of the related missing handling, specifically for load v8i8, store v8i8, and store v16i8. Digging through the history, it looks like all of these were in development at some point but fell off.

I'd like to drive this effort, but I want to try to arrive at a cohesive design first. Can I pick your brain on some things? Specifically, it seems that we have two separate mechanisms for lowering sub-32-bit vectors by "widening" the element types to 32-bit, and I want to try to understand why that is the case and whether we can converge on one of them to handle all cases.

  1. In the custom handling in NVPTXISelLowering::ReplaceLoadVector (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp#L6322), v8i16 is replaced with v4(2xi16) and eventually lowered as v4.b32 with some custom logic in NVPTXDAGToDAGISel::tryLoadVector. As far as I can tell this originates here with just v8f16, but now covers v8 vectors of i16, f16, and bf16.

  2. Alternatively, load v16xi8 is handled by this PR as a DagCombine, and future work was in development for load v8i8, store v8i8, and store v16i8 by extending the DagCombine mechanism, see [NVPTX] Preserve v16i8 vector loads when legalizing #67322 (comment) and [NVPTX] Lower 16xi8 and 8xi8 stores efficiently #73646 (review).

Questions/assumptions:

  1. My assumption is that this PR's design of using a DagCombine was chosen because, as you all discuss in the comments, v4i8 was not a legal type at the time of its creation, so the type legalizer handling wouldn't have been able to extend the pattern of v4(2xi16) to do something similar for v8(4xi8). Is this correct, or was it an even simpler case of "no one thought to do it that way"?
  2. Now that v4i8 is a legal type, is there any reason why v8i16 and v16i8 should not be handled by the same mechanism?
  3. If there is not, which mechanism would we prefer to converge on? In other words, would it make more sense to extend the DagCombine to handle v8[i/f/bf]16, or extend the type legalizer to handle v[8,16]i8?
  4. The PTX output of these two methods differ in a key way: the type legalizer mechanism ends up with vectors of b32s, while the DagCombine ends up with vectors of u32s. I assume we would rather that everything is canonicalized to one or the other, correct? Should that be b32 if possible? Or do we not really care?

cc @justinfargnoli, @AlexMaclean, @akrolik for viz.

@Artem-B
Copy link
Member

Artem-B commented Dec 4, 2024

If you have specific use case where it's worth the trouble, file an issue with a reproducer on the compiler explorer and we can take a look at what can be done about it.

would it make more sense to extend the DagCombine to handle v8[i/f/bf]16, or extend the type legalizer to handle v[8,16]i8?

I'm afraid there may be no clear cut answer. NVPTX is a rather odd target which breaks LLVM's assumptions. We have no vector instructions to speak of, nor do we have any vector register types (v2f16/v2bf16 hardly count). We carry 8-bit values in 16-bit registers. On the other hand we have vector loads/stores.... but they can't fully handle v8i8, v8i16 or v16i8 without extra code glue to convert those to b32/b64 and back. Conversion often kill any benefit of vectorizing the wide load/store, so it may not be worth it in general case, where the rest of the code would use loaded data in element-wise fashion.

The PTX output of these two methods differ in a key way: the type legalizer mechanism ends up with vectors of b32s, while the DagCombine ends up with vectors of u32s.

b32/u32 makes no difference for 32-bit loads/stores, so either way is fine.

A clear demonstration of a valid use case, and a patch demonstrating improved code generation would be welcome, and we can discuss the details there. Without actually implementing a prototype and seeing how clean/messy it is, it's hard to tell ahead of time what would work best here.

@dakersnar
Copy link
Contributor

Sounds good, I opened an issue here: #118851, we can continue the discussion there 👍.

One note I'll comment on here:

Conversion often kill any benefit of vectorizing the wide load/store, so it may not be worth it in general case, where the rest of the code would use loaded data in element-wise fashion.

My understanding is that chunks of load/store v4i8 gets lowered to u32 currently, so the conversion/extraction is already happening regardless, but currently v8i8 gets lowered to two u32s instead of a v2.u32, which seems like a clear loss. See the repro in my issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants