-
Notifications
You must be signed in to change notification settings - Fork 171
Missing load regrouping optimization when pointer is modified #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello Axel. Could you please provide a reproducer. With this snippet of code we are seeing 8 reads in both instances. |
Ok, I've just sent you a test by mail. |
Hi Axel. I am able to reproduce the issue that you have reported. I can see the loads are not being merged. I will notify you when we have a fix. |
Hello Axel. For an update on this issue, out team has come up with a couple different proposed solutions to handle this case. Basically it boils down to the following example. %add = add nsw i32 %mul10, %mul .... The tricky part about this is that the offset is embedded behind an add and a sext. Traditionally we use Scalar Evolution to handle cases like these but in this case SCEV cannot bring out the offset. One of our proposed solutions was to do a transformation to bring the constant int (1) closer to the gep but in that case it required some costly i64 promotions. Still working on trying to find a viable solution that does not have a potential performance impact. |
I see, thus it all comes down to the fact the pointers are 64 bits and the offsets 32 bits. There is a more global issue about the mixing of the two unfortunately.
I suspect in the former case, position has to be promoted to int64 and then multiplied by 4, whereas in the second case 4*position is int32, which enables optimizations in loops (replacing multiplications with counters and additions). In the case of this bug report though, even with 64 bits pointers vs 32bit ints, the loads should be merged. |
Hello Axel please see commit id f4c49be. This should fix this issue and merge the loads that are off of the same base. |
I confirm this is fixed. Thanks ! |
I have the following code
Ideally, the unrolling should cause into the send operations merging into two RGBA send operations.
Unfortunately with the above code, this doesn't happen: the unrolled code has 8 send operations.
However if data is loaded with the following line:
float data = src[src_offset + dz * items_img + position+i];
Then the optimization occurs and performance is much greater.
Expected behaviour: Both code should generate the optimization.
I can send code if requested, but I guess this issue should be reproducible with a small kernel and you may want to write such a kernel for your regression tests anyway.
I use release 18.26.10987
The text was updated successfully, but these errors were encountered: