-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[NVPTX] Preserve v16i8 vector loads when legalizing #67322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
e3702e7
to
012e4cc
Compare
// elements can be optimised away instead of being needlessly split during | ||
// legalization, which involves storing to the stack and loading it back. | ||
EVT VT = N->getValueType(0); | ||
if (VT != MVT::v16i8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to generalize it to v8i8, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that would work as well as v16i8, since I don't think there is a ld.v2.b32
instruction we could use. It would mean having to create two NVPTXISD::LoadV*
nodes here and duplicating some code from ReplaceLoadVector
.
By the way, I have also tried to do this change in ReplaceLoadVector
instead of adding a DAG combine for LOAD
nodes. I backtracked as this was creating stack operations. I didn't check again after your recent commit was merged, but maybe that works better now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a ld.v2.b32 instruction we could use
V2 ld/st variants do exist:
def _v2_avar : NVPTXInst< |
The code is easily parametrizable by NumElts
.
On the second thought we may be papering over the real problem that LLVM right now ends up generating rather slow code when we need to do anything with I've started working on this, I should have an idea withing a day or two whether the general improvement in v4i8 lowering would be sufficient to address this particular scenario, too. |
Yes, having to keep adding DAG combines to avoid the legalizer creating less efficient code is not ideal. Could we make |
That's roughly the idea. |
Is it worth separating out the |
Well, that brought less benefit than I hoped for. Making v4i8 a legal type helps to avoid issues in other areas but does not help much with this particular case. Your changes are still useful and needed.
Considering that these improvements are independent, it may indeed be a good idea to split them into separate patches. |
012e4cc
to
99659f6
Compare
Thanks. I have integrated your suggestions into the first commit (preserving |
Here's my PR for making v4i8 legal: #67866 It does improve some cases where we were previously stuck with ld/st.v4.i8 (see unfold-masked-merge-vector-variablemask.ll in the PR above), but, AFAICT, does not help with your test case Update: My PR does help with v8i16, just not in your example, for some reason. In the test case
|
488fac3
to
3349110
Compare
This is done by lowering v16i8 loads into LoadV4 operations with i32 results instead of letting ReplaceLoadVector split it into smaller loads during legalization. This is done at dag-combine1 time, so that vector operations with i8 elements can be optimised away instead of being needlessly split during legalization, which involves storing to the stack and loading it back.
3349110
to
a52cf9a
Compare
Sorry for the delay @Artem-B, I have been unexpectedly OoO last week. I have addressed your comments and all the checks are passing after rebasing, could you merge this commit? I don't have commit access to this repo. |
// elements can be optimised away instead of being needlessly split during | ||
// legalization, which involves storing to the stack and loading it back. | ||
EVT VT = N->getValueType(0); | ||
if (VT != MVT::v16i8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is a ld.v2.b32 instruction we could use
V2 ld/st variants do exist:
def _v2_avar : NVPTXInst< |
The code is easily parametrizable by NumElts
.
I didn't know about v2 variants, thanks. This is something I'd rather revisit in a future PR as this is quite an old patch now and I still need to create a PR for the second part of the original changes once this is merged. |
|
@pasaulais @Artem-B These are all really useful changes for v16i8 - thanks! While experimenting, I found a related scenario that has been missed. While the current trunk ensures v4.u32 is used for 16 x i8 loads, this isn't the case for 16 x i8 stores that don't have a load feeding into them. The simplest test case is below. llc generates four st.u32 here - while in theory, v4.store.b32 or v4.store.u32 could have been used. I can contribute a patch to make this efficient as well, with this same approach, if that makes sense.
On a related note, the following patch is a much smaller unit test case for this PR:
|
SGTM. Bonus points for covering v8i8, too. |
Here it is: #73646 |
Thqanks @bondhugula! It's great to have this for stores too and a much more concise reproducer |
Hi @Artem-B, I'm looking into filling out some of the related missing handling, specifically for I'd like to drive this effort, but I want to try to arrive at a cohesive design first. Can I pick your brain on some things? Specifically, it seems that we have two separate mechanisms for lowering sub-32-bit vectors by "widening" the element types to 32-bit, and I want to try to understand why that is the case and whether we can converge on one of them to handle all cases.
Questions/assumptions:
cc @justinfargnoli, @AlexMaclean, @akrolik for viz. |
If you have specific use case where it's worth the trouble, file an issue with a reproducer on the compiler explorer and we can take a look at what can be done about it.
I'm afraid there may be no clear cut answer. NVPTX is a rather odd target which breaks LLVM's assumptions. We have no vector instructions to speak of, nor do we have any vector register types (v2f16/v2bf16 hardly count). We carry 8-bit values in 16-bit registers. On the other hand we have vector loads/stores.... but they can't fully handle v8i8, v8i16 or v16i8 without extra code glue to convert those to b32/b64 and back. Conversion often kill any benefit of vectorizing the wide load/store, so it may not be worth it in general case, where the rest of the code would use loaded data in element-wise fashion.
A clear demonstration of a valid use case, and a patch demonstrating improved code generation would be welcome, and we can discuss the details there. Without actually implementing a prototype and seeing how clean/messy it is, it's hard to tell ahead of time what would work best here. |
Sounds good, I opened an issue here: #118851, we can continue the discussion there 👍. One note I'll comment on here:
My understanding is that chunks of |
The Load and Store Vectorizer pass combines sequential loads into vector loads, e.g. 16
i8
loads into av16i8
vector load. Such loads are not legal PTX butv4i32
(same vector width) is. This resulted in assembly like the following when every lane in the vector is extracted and extended:These changes add a new DAG combine operation for
v16i8
loads, which replaces the original load by av4i32
load and a bitcast.This results in the following code:
v16i8 loads are lowered into LoadV4 operations with i32 results instead of letting ReplaceLoadVector split it into smaller loads during legalization. This is done at dag-combine1 time, so that vector operations with i8 elements can be optimized away instead of being needlessly split during legalization.