Skip to content

Some performance issues with format conversion #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
axeldavy opened this issue Jun 25, 2018 · 12 comments
Closed

Some performance issues with format conversion #13

axeldavy opened this issue Jun 25, 2018 · 12 comments

Comments

@axeldavy
Copy link

I was told on #intel-gfx there was some interest reporting compiler inefficiencies here.

I have the following pattern with issues:

SRC_TYPE4 data = vload4(0, ptr_to_mydata);
if (something) {
some work
}
some work
table_of_float4[k] = convert_float4(data);
some work with table_of_float4[k]

This pattern is in a for loop. k is fixed. All accesses to the __private table_of_float4 have their indexes known in advance because all loops on it are unrolled, thus enabling to store it in registers.

If SRC_TYPE4 is uchar4, the conversion to float is moved before "if (something)", thus the latency is not well hidden and more registers are consumed. My program performance is 20% slower than with uint4.

if SRC_TYPE is float4 (no conversion needed), the compiler decides to store table_of_float4 in memory instead of registers, and the program performance is more than 10 times slower.

I used vtune to analyse the compiler output. I use the Neo driver on Skylake.

The code is intended to be published in the future, but I don't want to publish it here publicly yet. I can send it by mail to developpers if requested.

@ThomasRaoux
Copy link
Contributor

Yes performance and functional bug reports are always welcome. I can't think of any reason why the memory isn't promoted to register if the loop is unrolled.
If you could send me the different versions of the code (email in my profile) it should be easy to figure out what is going on.

@axeldavy
Copy link
Author

I've sent you by mail a sample code.

@paigeale
Copy link
Contributor

Hello axeldavy,

I worked with Thomas today and did some analysis on your user kernel. It appears in the case where "SRC_TYPE is float4" there is a large amount register pressure which causes us to spill. In the cases in which we have register spills we recompile with loop unrolling turned off (even with pragma unroll). That is why you are still seeing the stores to memory and the loop not unrolled.

So in summary, there is nothing particular about this pattern that is turning off the loop unrolling, it has to do with the overall register pressure of the program.

If you would like to ensure that the loops do get unrolled then the best method for this case would be to force the simdsize using the "intel_reqd_sub_group_size" attribute at the top of your kernel. Let me know if you have any questions or would like some assistance in setting this up.

Thanks,
Alex Paige

@axeldavy
Copy link
Author

Sorry, this was a typo if my message, and I meant "SRC_TYPE4 is float4" (thus "SRC_TYPE is float"). The size of the data is thus exactly the same than SRC_TYPE is uint, which doesn't trigger the spilling.

@paigeale
Copy link
Contributor

So I am still seeing the spilling in the uint case but it is less than the float case. What I realized though is we were not honoring your "#pragma unroll". I have a fix coming in soon for this. Overall though with the loop being unrolled it is still going to increase overall register pressure causing spills. Feel free to contact me via email (see profile) for additional support/ questions.

@axeldavy
Copy link
Author

axeldavy commented Jul 1, 2018

The bug report was done with the intel compute-runtime release 18.24.10921. I rested with the recently released 18.25.10965. It seems that this new version fixes several issues, in particular the spilling of table_of_float4 doesn't happen anymore if SRC_TYPE4 is float4.
Instead mov operations are inserted just after the memory load to move the loaded registers into another ones, similarly to the pattern described for when SRC_TYPE4 is uchar4.

@paigeale
Copy link
Contributor

Hello Axel. If the new runtime release fixed your issues can this be closed?

@axeldavy
Copy link
Author

The part of the bug about format conversion occuring just after the memory load instead of just before the data is needed is not solved to my knowledge. I don't remember if among the code I sent, one enables to reproduce this issue. I will send a new code to highlight the issue.

@paigeale
Copy link
Contributor

Hello Axel. I am not sure if I have received a new reproducer for this issue. Could you please send it, if this is still an issue.

@axeldavy
Copy link
Author

axeldavy commented Aug 27, 2018

I sent it to your collegue, I will forward you.

EDIT: no I was mistaken with the other bug report. I did send you the new reproducer a few weeks ago. I send it again.

@iwwu
Copy link
Contributor

iwwu commented Jan 9, 2019

Hi Axel, It has been a while since the issue was last updated. Can you comment on the issue with latest code?

@iwwu
Copy link
Contributor

iwwu commented Jan 31, 2019

Haven't heard anything recently regarding this issue. Please re-open if you still need help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants