Missing load regrouping optimization when pointer is modified #14

axeldavy · 2018-07-03T15:43:55Z

I have the following code

__kernel void compute_nearest_neighboors_naive(__global float * restrict dst,
                                               __global const SRC_TYPE * restrict src,
                                               int w,
                                               int h,
                                               int items_row,
                                               int items_img,
                                               int src_offset,
                                               int dst_offset)
{
    ...
    __global const SRC_TYPE * restrict src_shifted = src + src_offset + dz * items_img;
    ...
    #pragma unroll
    for (int i = 0; i < 8; i++) {
        float data = src_shifted[position+i];
        ....//something with data
    }
    ...
}

Ideally, the unrolling should cause into the send operations merging into two RGBA send operations.
Unfortunately with the above code, this doesn't happen: the unrolled code has 8 send operations.

However if data is loaded with the following line:
float data = src[src_offset + dz * items_img + position+i];

Then the optimization occurs and performance is much greater.

Expected behaviour: Both code should generate the optimization.

I can send code if requested, but I guess this issue should be reproducible with a small kernel and you may want to write such a kernel for your regression tests anyway.

I use release 18.26.10987

The text was updated successfully, but these errors were encountered:

paigeale · 2018-07-03T22:06:21Z

Hello Axel. Could you please provide a reproducer. With this snippet of code we are seeing 8 reads in both instances.

axeldavy · 2018-07-03T22:38:07Z

Ok, I've just sent you a test by mail.

paigeale · 2018-07-12T22:36:52Z

Hi Axel. I am able to reproduce the issue that you have reported. I can see the loads are not being merged. I will notify you when we have a fix.

paigeale · 2018-07-20T17:28:35Z

Hello Axel. For an update on this issue, out team has come up with a couple different proposed solutions to handle this case. Basically it boils down to the following example.

%add = add nsw i32 %mul10, %mul
....
%idxprom12 = sext i32 %add to i64
%add.ptr.sum = add nsw i64 %idx.ext, %idxprom12
%arrayidx13 = getelementptr inbounds i32, i32 addrspace(1)* %src, i64 %add.ptr.sum
%0 = load i32, i32 addrspace(1)* %arrayidx13, align 4, !tbaa !121

....
....
....
%add11.1 = add nsw i32 %add, 1
%idxprom12.1 = sext i32 %add11.1 to i64
%add.ptr.sum.1 = add nsw i64 %idx.ext, %idxprom12.1
%arrayidx13.1 = getelementptr inbounds i32, i32 addrspace(1)* %src, i64 %add.ptr.sum.1
%6 = load i32, i32 addrspace(1)* %arrayidx13.1, align 4, !tbaa !121

The tricky part about this is that the offset is embedded behind an add and a sext. Traditionally we use Scalar Evolution to handle cases like these but in this case SCEV cannot bring out the offset. One of our proposed solutions was to do a transformation to bring the constant int (1) closer to the gep but in that case it required some costly i64 promotions. Still working on trying to find a viable solution that does not have a potential performance impact.

axeldavy · 2018-07-20T17:55:52Z

I see, thus it all comes down to the fact the pointers are 64 bits and the offsets 32 bits.

There is a more global issue about the mixing of the two unfortunately.
Among all three vendors, the following code is slower in a loop than the second:

__global float *src;
...
a = src[position];

__global float *src;
...
a = *((__global float *) ((__global uchar *)src + 4*position);

I suspect in the former case, position has to be promoted to int64 and then multiplied by 4, whereas in the second case 4*position is int32, which enables optimizations in loops (replacing multiplications with counters and additions).
I'm surprised there aren't more warnings about that in OpenCL guides.

In the case of this bug report though, even with 64 bits pointers vs 32bit ints, the loads should be merged.

paigeale · 2018-09-20T23:25:07Z

Hello Axel please see commit id f4c49be. This should fix this issue and merge the loads that are off of the same base.

axeldavy · 2018-09-21T08:38:27Z

I confirm this is fixed. Thanks !

axeldavy closed this as completed Sep 21, 2018

troels mentioned this issue Aug 22, 2021

Infinite loop somewhere around CodeSinking.cpp #192

Closed

haonanya mentioned this issue Jan 25, 2022

Compiler crashing on my OpenCL kernel #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing load regrouping optimization when pointer is modified #14

Missing load regrouping optimization when pointer is modified #14

axeldavy commented Jul 3, 2018 •

edited

Loading

paigeale commented Jul 3, 2018

axeldavy commented Jul 3, 2018

paigeale commented Jul 12, 2018

paigeale commented Jul 20, 2018

axeldavy commented Jul 20, 2018

paigeale commented Sep 20, 2018

axeldavy commented Sep 21, 2018

Missing load regrouping optimization when pointer is modified #14

Missing load regrouping optimization when pointer is modified #14

Comments

axeldavy commented Jul 3, 2018 • edited Loading

paigeale commented Jul 3, 2018

axeldavy commented Jul 3, 2018

paigeale commented Jul 12, 2018

paigeale commented Jul 20, 2018

axeldavy commented Jul 20, 2018

paigeale commented Sep 20, 2018

axeldavy commented Sep 21, 2018

axeldavy commented Jul 3, 2018 •

edited

Loading