Skip to content

Missing load regrouping optimization when pointer is modified #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
axeldavy opened this issue Jul 3, 2018 · 7 comments
Closed

Missing load regrouping optimization when pointer is modified #14

axeldavy opened this issue Jul 3, 2018 · 7 comments

Comments

@axeldavy
Copy link

axeldavy commented Jul 3, 2018

I have the following code

__kernel void compute_nearest_neighboors_naive(__global float * restrict dst,
                                               __global const SRC_TYPE * restrict src,
                                               int w,
                                               int h,
                                               int items_row,
                                               int items_img,
                                               int src_offset,
                                               int dst_offset)
{
    ...
    __global const SRC_TYPE * restrict src_shifted = src + src_offset + dz * items_img;
    ...
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        float data = src_shifted[position+i];
        ....//something with data
    }
    ...
}

Ideally, the unrolling should cause into the send operations merging into two RGBA send operations.
Unfortunately with the above code, this doesn't happen: the unrolled code has 8 send operations.

However if data is loaded with the following line:
float data = src[src_offset + dz * items_img + position+i];

Then the optimization occurs and performance is much greater.

Expected behaviour: Both code should generate the optimization.

I can send code if requested, but I guess this issue should be reproducible with a small kernel and you may want to write such a kernel for your regression tests anyway.

I use release 18.26.10987

@paigeale
Copy link
Contributor

paigeale commented Jul 3, 2018

Hello Axel. Could you please provide a reproducer. With this snippet of code we are seeing 8 reads in both instances.

@axeldavy
Copy link
Author

axeldavy commented Jul 3, 2018

Ok, I've just sent you a test by mail.

@paigeale
Copy link
Contributor

Hi Axel. I am able to reproduce the issue that you have reported. I can see the loads are not being merged. I will notify you when we have a fix.

@paigeale
Copy link
Contributor

Hello Axel. For an update on this issue, out team has come up with a couple different proposed solutions to handle this case. Basically it boils down to the following example.

%add = add nsw i32 %mul10, %mul
....
%idxprom12 = sext i32 %add to i64
%add.ptr.sum = add nsw i64 %idx.ext, %idxprom12
%arrayidx13 = getelementptr inbounds i32, i32 addrspace(1)* %src, i64 %add.ptr.sum
%0 = load i32, i32 addrspace(1)* %arrayidx13, align 4, !tbaa !121

....
....
....
%add11.1 = add nsw i32 %add, 1
%idxprom12.1 = sext i32 %add11.1 to i64
%add.ptr.sum.1 = add nsw i64 %idx.ext, %idxprom12.1
%arrayidx13.1 = getelementptr inbounds i32, i32 addrspace(1)* %src, i64 %add.ptr.sum.1
%6 = load i32, i32 addrspace(1)* %arrayidx13.1, align 4, !tbaa !121

The tricky part about this is that the offset is embedded behind an add and a sext. Traditionally we use Scalar Evolution to handle cases like these but in this case SCEV cannot bring out the offset. One of our proposed solutions was to do a transformation to bring the constant int (1) closer to the gep but in that case it required some costly i64 promotions. Still working on trying to find a viable solution that does not have a potential performance impact.

@axeldavy
Copy link
Author

I see, thus it all comes down to the fact the pointers are 64 bits and the offsets 32 bits.

There is a more global issue about the mixing of the two unfortunately.
Among all three vendors, the following code is slower in a loop than the second:

__global float *src;
...
a = src[position];
__global float *src;
...
a = *((__global float *) ((__global uchar *)src + 4*position);

I suspect in the former case, position has to be promoted to int64 and then multiplied by 4, whereas in the second case 4*position is int32, which enables optimizations in loops (replacing multiplications with counters and additions).
I'm surprised there aren't more warnings about that in OpenCL guides.

In the case of this bug report though, even with 64 bits pointers vs 32bit ints, the loads should be merged.

@paigeale
Copy link
Contributor

Hello Axel please see commit id f4c49be. This should fix this issue and merge the loads that are off of the same base.

@axeldavy
Copy link
Author

I confirm this is fixed. Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants