Some performance issues with format conversion #13

axeldavy · 2018-06-25T07:36:21Z

I was told on #intel-gfx there was some interest reporting compiler inefficiencies here.

I have the following pattern with issues:

SRC_TYPE4 data = vload4(0, ptr_to_mydata);
if (something) {
some work
}
some work
table_of_float4[k] = convert_float4(data);
some work with table_of_float4[k]

This pattern is in a for loop. k is fixed. All accesses to the __private table_of_float4 have their indexes known in advance because all loops on it are unrolled, thus enabling to store it in registers.

If SRC_TYPE4 is uchar4, the conversion to float is moved before "if (something)", thus the latency is not well hidden and more registers are consumed. My program performance is 20% slower than with uint4.

if SRC_TYPE is float4 (no conversion needed), the compiler decides to store table_of_float4 in memory instead of registers, and the program performance is more than 10 times slower.

I used vtune to analyse the compiler output. I use the Neo driver on Skylake.

The code is intended to be published in the future, but I don't want to publish it here publicly yet. I can send it by mail to developpers if requested.

ThomasRaoux · 2018-06-25T08:13:28Z

Yes performance and functional bug reports are always welcome. I can't think of any reason why the memory isn't promoted to register if the loop is unrolled.
If you could send me the different versions of the code (email in my profile) it should be easy to figure out what is going on.

axeldavy · 2018-06-25T11:12:45Z

I've sent you by mail a sample code.

paigeale · 2018-06-27T00:37:26Z

Hello axeldavy,

I worked with Thomas today and did some analysis on your user kernel. It appears in the case where "SRC_TYPE is float4" there is a large amount register pressure which causes us to spill. In the cases in which we have register spills we recompile with loop unrolling turned off (even with pragma unroll). That is why you are still seeing the stores to memory and the loop not unrolled.

So in summary, there is nothing particular about this pattern that is turning off the loop unrolling, it has to do with the overall register pressure of the program.

If you would like to ensure that the loops do get unrolled then the best method for this case would be to force the simdsize using the "intel_reqd_sub_group_size" attribute at the top of your kernel. Let me know if you have any questions or would like some assistance in setting this up.

Thanks,
Alex Paige

axeldavy · 2018-06-27T06:17:48Z

Sorry, this was a typo if my message, and I meant "SRC_TYPE4 is float4" (thus "SRC_TYPE is float"). The size of the data is thus exactly the same than SRC_TYPE is uint, which doesn't trigger the spilling.

paigeale · 2018-06-29T00:55:47Z

So I am still seeing the spilling in the uint case but it is less than the float case. What I realized though is we were not honoring your "#pragma unroll". I have a fix coming in soon for this. Overall though with the loop being unrolled it is still going to increase overall register pressure causing spills. Feel free to contact me via email (see profile) for additional support/ questions.

axeldavy · 2018-07-01T11:18:08Z

The bug report was done with the intel compute-runtime release 18.24.10921. I rested with the recently released 18.25.10965. It seems that this new version fixes several issues, in particular the spilling of table_of_float4 doesn't happen anymore if SRC_TYPE4 is float4.
Instead mov operations are inserted just after the memory load to move the loaded registers into another ones, similarly to the pattern described for when SRC_TYPE4 is uchar4.

paigeale · 2018-07-20T17:32:10Z

Hello Axel. If the new runtime release fixed your issues can this be closed?

axeldavy · 2018-07-20T17:44:16Z

The part of the bug about format conversion occuring just after the memory load instead of just before the data is needed is not solved to my knowledge. I don't remember if among the code I sent, one enables to reproduce this issue. I will send a new code to highlight the issue.

paigeale · 2018-08-27T18:44:19Z

Hello Axel. I am not sure if I have received a new reproducer for this issue. Could you please send it, if this is still an issue.

axeldavy · 2018-08-27T20:58:33Z

I sent it to your collegue, I will forward you.

EDIT: no I was mistaken with the other bug report. I did send you the new reproducer a few weeks ago. I send it again.

iwwu · 2019-01-09T19:15:23Z

Hi Axel, It has been a while since the issue was last updated. Can you comment on the issue with latest code?

iwwu · 2019-01-31T19:44:27Z

Haven't heard anything recently regarding this issue. Please re-open if you still need help.

…on) (#13) Co-authored-by: Dmitry Ryabtsev <[email protected]>

…on) (#13)

iwwu closed this as completed Jan 31, 2019

VPG-SWE-Github pushed a commit that referenced this issue Nov 2, 2020

update calling conv only for imported from BiF funcs in VC (2nd editi…

799072b

…on) (#13) Co-authored-by: Dmitry Ryabtsev <[email protected]>

VPG-SWE-Github pushed a commit that referenced this issue Nov 3, 2020

update calling conv only for imported from BiF funcs in VC (2nd editi…

3dccbad

…on) (#13) Co-authored-by: Dmitry Ryabtsev <[email protected]>

VPG-SWE-Github pushed a commit that referenced this issue Nov 3, 2020

update calling conv only for imported from BiF funcs in VC (2nd editi…

62f8e86

…on) (#13)

VPG-SWE-Github pushed a commit that referenced this issue Nov 3, 2020

update calling conv only for imported from BiF funcs in VC (2nd editi…

072db34

…on) (#13)

VPG-SWE-Github pushed a commit that referenced this issue Nov 3, 2020

update calling conv only for imported from BiF funcs in VC (2nd editi…

a468754

…on) (#13)

VPG-SWE-Github pushed a commit that referenced this issue Nov 5, 2020

update calling conv only for imported from BiF funcs in VC (2nd editi…

c33d529

…on) (#13)

haonanya mentioned this issue Jan 25, 2022

Compiler crashing on my OpenCL kernel #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some performance issues with format conversion #13

Some performance issues with format conversion #13

axeldavy commented Jun 25, 2018

ThomasRaoux commented Jun 25, 2018

axeldavy commented Jun 25, 2018

paigeale commented Jun 27, 2018

axeldavy commented Jun 27, 2018

paigeale commented Jun 29, 2018

axeldavy commented Jul 1, 2018

paigeale commented Jul 20, 2018

axeldavy commented Jul 20, 2018

paigeale commented Aug 27, 2018

axeldavy commented Aug 27, 2018 •

edited

Loading

iwwu commented Jan 9, 2019

iwwu commented Jan 31, 2019

Some performance issues with format conversion #13

Some performance issues with format conversion #13

Comments

axeldavy commented Jun 25, 2018

ThomasRaoux commented Jun 25, 2018

axeldavy commented Jun 25, 2018

paigeale commented Jun 27, 2018

axeldavy commented Jun 27, 2018

paigeale commented Jun 29, 2018

axeldavy commented Jul 1, 2018

paigeale commented Jul 20, 2018

axeldavy commented Jul 20, 2018

paigeale commented Aug 27, 2018

axeldavy commented Aug 27, 2018 • edited Loading

iwwu commented Jan 9, 2019

iwwu commented Jan 31, 2019

axeldavy commented Aug 27, 2018 •

edited

Loading