[NVPTX] Preserve v16i8 vector loads when legalizing #67322

pasaulais · 2023-09-25T12:39:42Z

The Load and Store Vectorizer pass combines sequential loads into vector loads, e.g. 16 i8 loads into a v16i8 vector load. Such loads are not legal PTX but v4i32 (same vector width) is. This resulted in assembly like the following when every lane in the vector is extracted and extended:

	ld.v4.u8 	{%rs1, %rs2, %rs3, %rs4}, [%rd2+12];
	ld.v4.u8 	{%rs5, %rs6, %rs7, %rs8}, [%rd2+8];
	ld.v4.u8 	{%rs9, %rs10, %rs11, %rs12}, [%rd2+4];
	ld.v4.u8 	{%rs13, %rs14, %rs15, %rs16}, [%rd2];

These changes add a new DAG combine operation for v16i8 loads, which replaces the original load by a v4i32 load and a bitcast.

This results in the following code:

	ld.v4.u32 	{%r1, %r2, %r3, %r4}, [%rd2];

v16i8 loads are lowered into LoadV4 operations with i32 results instead of letting ReplaceLoadVector split it into smaller loads during legalization. This is done at dag-combine1 time, so that vector operations with i8 elements can be optimized away instead of being needlessly split during legalization.

github-actions · 2023-09-25T12:50:38Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Artem-B · 2023-09-25T18:32:35Z

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

+  // elements can be optimised away instead of being needlessly split during
+  // legalization, which involves storing to the stack and loading it back.
+  EVT VT = N->getValueType(0);
+  if (VT != MVT::v16i8)


Do we want to generalize it to v8i8, too?

I'm not sure if that would work as well as v16i8, since I don't think there is a ld.v2.b32 instruction we could use. It would mean having to create two NVPTXISD::LoadV* nodes here and duplicating some code from ReplaceLoadVector.

By the way, I have also tried to do this change in ReplaceLoadVector instead of adding a DAG combine for LOAD nodes. I backtracked as this was creating stack operations. I didn't check again after your recent commit was merged, but maybe that works better now.

I don't think there is a ld.v2.b32 instruction we could use

V2 ld/st variants do exist:

llvm-project/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

Line 2919 in ddf1de2

def _v2_avar : NVPTXInst<

The code is easily parametrizable by NumElts.

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

llvm/test/CodeGen/NVPTX/LoadStoreVectorizer.ll

Artem-B · 2023-09-26T05:31:02Z

On the second thought we may be papering over the real problem that LLVM right now ends up generating rather slow code when we need to do anything with v4i8. If we improve that, then special-casing lowering of v16i8/v8i8 etc will become unnecessary.

I've started working on this, I should have an idea withing a day or two whether the general improvement in v4i8 lowering would be sufficient to address this particular scenario, too.

pasaulais · 2023-09-26T16:51:08Z

Yes, having to keep adding DAG combines to avoid the legalizer creating less efficient code is not ideal. Could we make v4i8 a legal type? That's probably more changes to the backend overall (and having to mark a whole lot of operations as Expand), but might be cleaner/easier to optimize.

Artem-B · 2023-09-26T17:13:55Z

Could we make v4i8 a legal type?

That's roughly the idea.

pasaulais · 2023-09-28T15:37:46Z

Is it worth separating out the i8 extraction part from these changes and only keep the v16i8 load part for now? I have not updated this PR with your suggestions yet as it may need reworking anyway after your v4i8 work

Artem-B · 2023-09-28T18:21:27Z

whether the general improvement in v4i8 lowering would be sufficient to address this particular scenario, too.

Well, that brought less benefit than I hoped for. Making v4i8 a legal type helps to avoid issues in other areas but does not help much with this particular case. Your changes are still useful and needed.

Is it worth separating out the i8 extraction part from these changes and only keep the v16i8 load part for now? I have not updated this PR with your suggestions yet as it may need reworking anyway after your v4i8 work

Considering that these improvements are independent, it may indeed be a good idea to split them into separate patches.

pasaulais · 2023-09-29T20:27:04Z

whether the general improvement in v4i8 lowering would be sufficient to address this particular scenario, too.

Well, that brought less benefit than I hoped for. Making v4i8 a legal type helps to avoid issues in other areas but does not help much with this particular case. Your changes are still useful and needed.

Is it worth separating out the i8 extraction part from these changes and only keep the v16i8 load part for now? I have not updated this PR with your suggestions yet as it may need reworking anyway after your v4i8 work

Considering that these improvements are independent, it may indeed be a good idea to split them into separate patches.

Thanks. I have integrated your suggestions into the first commit (preserving v16i8 loads) and removed the second commit from this PR. I'll create a new PR for the i8 extraction DAG combine when this one is merged.

Artem-B · 2023-09-29T23:52:05Z

Here's my PR for making v4i8 legal: #67866

It does improve some cases where we were previously stuck with ld/st.v4.i8 (see unfold-masked-merge-vector-variablemask.ll in the PR above), but, AFAICT, does not help with your test case combine_v16i8.

Update: My PR does help with v8i16, just not in your example, for some reason. In the test case @out_v16i8 in unfold-masked-merge-vector-variablemask.ll we now vectorize loading of v16i8 arguments into:

define <16 x i8> @out_v16i8(<16 x i8> %x, <16 x i8> %y, <16 x i8> %mask)
; CHECK-NEXT:    ld.param.v4.u32 {%r1, %r2, %r3, %r4}, [out_v16i8_param_1];
; CHECK-NEXT:    ld.param.v4.u32 {%r9, %r10, %r11, %r12}, [out_v16i8_param_2];
; CHECK-NEXT:    ld.param.v4.u32 {%r17, %r18, %r19, %r20}, [out_v16i8_param_0];

This is done by lowering v16i8 loads into LoadV4 operations with i32 results instead of letting ReplaceLoadVector split it into smaller loads during legalization. This is done at dag-combine1 time, so that vector operations with i8 elements can be optimised away instead of being needlessly split during legalization, which involves storing to the stack and loading it back.

pasaulais · 2023-10-12T12:09:30Z

Sorry for the delay @Artem-B, I have been unexpectedly OoO last week. I have addressed your comments and all the checks are passing after rebasing, could you merge this commit? I don't have commit access to this repo.

Artem-B · 2023-10-18T18:20:32Z

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

+  // elements can be optimised away instead of being needlessly split during
+  // legalization, which involves storing to the stack and loading it back.
+  EVT VT = N->getValueType(0);
+  if (VT != MVT::v16i8)


I don't think there is a ld.v2.b32 instruction we could use

V2 ld/st variants do exist:

llvm-project/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

Line 2919 in ddf1de2

def _v2_avar : NVPTXInst<

The code is easily parametrizable by NumElts.

pasaulais · 2023-10-19T10:41:05Z

I didn't know about v2 variants, thanks. This is something I'd rather revisit in a future PR as this is quite an old patch now and I still need to create a PR for the second part of the original changes once this is merged.

ldrumm · 2023-10-19T11:38:08Z

To github.com:llvm/llvm-project.git                                                                                                                            
   906d3ff054b0..0b80288e9e0b  main -> main

bondhugula · 2023-11-23T10:11:44Z

@pasaulais @Artem-B These are all really useful changes for v16i8 - thanks! While experimenting, I found a related scenario that has been missed. While the current trunk ensures v4.u32 is used for 16 x i8 loads, this isn't the case for 16 x i8 stores that don't have a load feeding into them. The simplest test case is below. llc generates four st.u32 here - while in theory, v4.store.b32 or v4.store.u32 could have been used. I can contribute a patch to make this efficient as well, with this same approach, if that makes sense. store v16 x i8 -> bitcast v16i8 to v4 x (v4 x i8) + store v4 x i32.

; RUN: llc < %s -march=nvptx -mcpu=sm_20 | FileCheck %s

; CHECK-LABEL: foo3
define void @foo3(ptr %a, ptr %b, <16 x i8> %v) {
; CHECK: st.v4.b32
  store <16 x i8> %v, ptr %b
  ret void
}

//
// Generated by LLVM NVPTX Back-End
//

.version 7.0
.target sm_80
.address_size 32

	// .globl	foo3                    // -- Begin function foo3
                                        // @foo3
.visible .func foo3(
	.param .b32 foo3_param_0,
	.param .b32 foo3_param_1,
	.param .align 16 .b8 foo3_param_2[16]
)
{
	.reg .b32 	%r<6>;

// %bb.0:
	ld.param.u32 	%r1, [foo3_param_1];
	ld.param.v4.u32 	{%r2, %r3, %r4, %r5}, [foo3_param_2];
	st.u32 	[%r1+12], %r5;
	st.u32 	[%r1+8], %r4;
	st.u32 	[%r1+4], %r3;
	st.u32 	[%r1], %r2;
	ret;
                                        // -- End function

On a related note, the following patch is a much smaller unit test case for this PR:

--- a/llvm/test/CodeGen/NVPTX/vector-stores.ll
+++ b/llvm/test/CodeGen/NVPTX/vector-stores.ll
@@ -29,3 +29,11 @@ define void @foo4(<4 x i32> %val, ptr %ptr) {
   ret void
 }
 
+; CHECK-LABEL: @v16i8
+define void @v16i8(ptr %a, ptr %b) {
+; CHECK: ld.v4.u32
+; CHECK: st.v4.u32
+  %v = load <16 x i8>, ptr %a
+  store <16 x i8> %v, ptr %b
+  ret void
+}

Artem-B · 2023-11-24T06:16:03Z

SGTM. Bonus points for covering v8i8, too.

bondhugula · 2023-11-28T13:56:12Z

SGTM. Bonus points for covering v8i8, too.

Here it is: #73646

pasaulais · 2023-12-13T14:41:39Z

Thqanks @bondhugula! It's great to have this for stores too and a much more concise reproducer

dakersnar · 2024-12-04T20:38:10Z

Hi @Artem-B, I'm looking into filling out some of the related missing handling, specifically for load v8i8, store v8i8, and store v16i8. Digging through the history, it looks like all of these were in development at some point but fell off.

I'd like to drive this effort, but I want to try to arrive at a cohesive design first. Can I pick your brain on some things? Specifically, it seems that we have two separate mechanisms for lowering sub-32-bit vectors by "widening" the element types to 32-bit, and I want to try to understand why that is the case and whether we can converge on one of them to handle all cases.

In the custom handling in NVPTXISelLowering::ReplaceLoadVector (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp#L6322), v8i16 is replaced with v4(2xi16) and eventually lowered as v4.b32 with some custom logic in NVPTXDAGToDAGISel::tryLoadVector. As far as I can tell this originates here with just v8f16, but now covers v8 vectors of i16, f16, and bf16.
Alternatively, load v16xi8 is handled by this PR as a DagCombine, and future work was in development for load v8i8, store v8i8, and store v16i8 by extending the DagCombine mechanism, see [NVPTX] Preserve v16i8 vector loads when legalizing #67322 (comment) and [NVPTX] Lower 16xi8 and 8xi8 stores efficiently #73646 (review).

Questions/assumptions:

My assumption is that this PR's design of using a DagCombine was chosen because, as you all discuss in the comments, v4i8 was not a legal type at the time of its creation, so the type legalizer handling wouldn't have been able to extend the pattern of v4(2xi16) to do something similar for v8(4xi8). Is this correct, or was it an even simpler case of "no one thought to do it that way"?
Now that v4i8 is a legal type, is there any reason why v8i16 and v16i8 should not be handled by the same mechanism?
If there is not, which mechanism would we prefer to converge on? In other words, would it make more sense to extend the DagCombine to handle v8[i/f/bf]16, or extend the type legalizer to handle v[8,16]i8?
The PTX output of these two methods differ in a key way: the type legalizer mechanism ends up with vectors of b32s, while the DagCombine ends up with vectors of u32s. I assume we would rather that everything is canonicalized to one or the other, correct? Should that be b32 if possible? Or do we not really care?

cc @justinfargnoli, @AlexMaclean, @akrolik for viz.

Artem-B · 2024-12-04T21:16:46Z

If you have specific use case where it's worth the trouble, file an issue with a reproducer on the compiler explorer and we can take a look at what can be done about it.

would it make more sense to extend the DagCombine to handle v8[i/f/bf]16, or extend the type legalizer to handle v[8,16]i8?

I'm afraid there may be no clear cut answer. NVPTX is a rather odd target which breaks LLVM's assumptions. We have no vector instructions to speak of, nor do we have any vector register types (v2f16/v2bf16 hardly count). We carry 8-bit values in 16-bit registers. On the other hand we have vector loads/stores.... but they can't fully handle v8i8, v8i16 or v16i8 without extra code glue to convert those to b32/b64 and back. Conversion often kill any benefit of vectorizing the wide load/store, so it may not be worth it in general case, where the rest of the code would use loaded data in element-wise fashion.

The PTX output of these two methods differ in a key way: the type legalizer mechanism ends up with vectors of b32s, while the DagCombine ends up with vectors of u32s.

b32/u32 makes no difference for 32-bit loads/stores, so either way is fine.

A clear demonstration of a valid use case, and a patch demonstrating improved code generation would be welcome, and we can discuss the details there. Without actually implementing a prototype and seeing how clean/messy it is, it's hard to tell ahead of time what would work best here.

dakersnar · 2024-12-05T18:15:13Z

Sounds good, I opened an issue here: #118851, we can continue the discussion there 👍.

One note I'll comment on here:

Conversion often kill any benefit of vectorizing the wide load/store, so it may not be worth it in general case, where the rest of the code would use loaded data in element-wise fashion.

My understanding is that chunks of load/store v4i8 gets lowered to u32 currently, so the conversion/extraction is already happening regardless, but currently v8i8 gets lowered to two u32s instead of a v2.u32, which seems like a clear loss. See the repro in my issue.

pasaulais force-pushed the nvptx-opt-v16i8-reduction branch from e3702e7 to 012e4cc Compare September 25, 2023 12:52

Artem-B reviewed Sep 25, 2023

View reviewed changes

pasaulais changed the title ~~[NVPTX] Optimize v16i8 reductions~~ [NVPTX] Preserve v16i8 vector loads when legalizing Sep 29, 2023

pasaulais force-pushed the nvptx-opt-v16i8-reduction branch from 012e4cc to 99659f6 Compare September 29, 2023 20:23

pasaulais force-pushed the nvptx-opt-v16i8-reduction branch 3 times, most recently from 488fac3 to 3349110 Compare September 30, 2023 09:46

pasaulais force-pushed the nvptx-opt-v16i8-reduction branch from 3349110 to a52cf9a Compare October 11, 2023 15:29

pasaulais requested a review from Artem-B October 18, 2023 10:07

Artem-B approved these changes Oct 18, 2023

View reviewed changes

ldrumm closed this Oct 19, 2023

Artem-B mentioned this pull request Sep 9, 2024

[NVPTX] Restrict combining to properly aligned v16i8 vectors. #107919

Merged

dakersnar mentioned this pull request Dec 5, 2024

[NVPTX] Incomplete and inconsistent mechanisms for lowering vectors loads/stores with sub-32-bit values #118851

Closed

[NVPTX] Preserve v16i8 vector loads when legalizing #67322

[NVPTX] Preserve v16i8 vector loads when legalizing #67322

Uh oh!

Conversation

pasaulais commented Sep 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Artem-B Sep 25, 2023

Choose a reason for hiding this comment

Uh oh!

pasaulais Sep 26, 2023

Choose a reason for hiding this comment

Uh oh!

Artem-B Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Artem-B commented Sep 26, 2023

Uh oh!

pasaulais commented Sep 26, 2023

Uh oh!

Artem-B commented Sep 26, 2023

Uh oh!

pasaulais commented Sep 28, 2023

Uh oh!

Artem-B commented Sep 28, 2023

Uh oh!

pasaulais commented Sep 29, 2023

Uh oh!

Artem-B commented Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pasaulais commented Oct 12, 2023

Uh oh!

Artem-B Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

pasaulais commented Oct 19, 2023

Uh oh!

ldrumm commented Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bondhugula commented Nov 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artem-B commented Nov 24, 2023

Uh oh!

bondhugula commented Nov 28, 2023

Uh oh!

pasaulais commented Dec 13, 2023

Uh oh!

dakersnar commented Dec 4, 2024

Uh oh!

Artem-B commented Dec 4, 2024

Uh oh!

dakersnar commented Dec 5, 2024

Uh oh!

Uh oh!

pasaulais commented Sep 25, 2023 •

edited

Loading

github-actions bot commented Sep 25, 2023 •

edited

Loading

Artem-B commented Sep 29, 2023 •

edited

Loading

ldrumm commented Oct 19, 2023 •

edited

Loading

bondhugula commented Nov 23, 2023 •

edited

Loading