[NVPTX] Lower 16xi8 and 8xi8 stores efficiently#73646
Conversation
180ee21 to
9d747dd
Compare
6ab9db7 to
298e563
Compare
There was a problem hiding this comment.
Nice. Legalizer assuming that stack loads/stores are cheap is indeed a rather bad misoptimization for NVPTX.
There was a problem hiding this comment.
Note that this comment might be out of date, as it looks copied from PerformLOADCombine and that was written before stack optimizations were done
Lower 16xi8 vector stores in NVPTX ISel efficiently using st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads and 8xf16. Similarly, 8xi8 using st.v2.u32.
298e563 to
c197301
Compare
|
This seems to have injected failures into Halide codegen; we are now getting runtime errors of the form |
| ; CHECK-LABEL: .visible .func v8i8_store | ||
| define void @v8i8_store(ptr %a, <8 x i8> %v) { | ||
| ; CHECK: st.v2.u32 | ||
| store <8 x i8> %v, ptr %a |
There was a problem hiding this comment.
This is only correct if the pointer is aligned to a 4-byte-boundary (IIUC), but AFAIK nothing in the IR to this point promises that alignment
There was a problem hiding this comment.
You're right. Using larger types for loads/stores must be aligned appropriately.
We do use allowsMemoryAccessForAlignment in other places.
There was a problem hiding this comment.
In that case, we should revert it if a fix-forward is not imminent (this is breaking all of Halide's Cuda tests).
This reverts commit 173fcf7. Needs to constrain the optimization to properly aligned loads/stores only. llvm#73646 (comment)
…4518) This reverts commit 173fcf7. We need to constrain the optimization to properly aligned loads/stores only. #73646 (comment)
pasaulais
left a comment
There was a problem hiding this comment.
LGTM once the alignment issue is addressed
Lower 16xi8 vector stores in NVPTX ISel efficiently using
st.v4.b32 instead of multiple st.v4.u8 along the lines of vector loads
and 8xf16. Similarly, 8xi8 using st.v2.u32.