-
Notifications
You must be signed in to change notification settings - Fork 171
Some performance issues with format conversion #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes performance and functional bug reports are always welcome. I can't think of any reason why the memory isn't promoted to register if the loop is unrolled. |
I've sent you by mail a sample code. |
Hello axeldavy, I worked with Thomas today and did some analysis on your user kernel. It appears in the case where "SRC_TYPE is float4" there is a large amount register pressure which causes us to spill. In the cases in which we have register spills we recompile with loop unrolling turned off (even with pragma unroll). That is why you are still seeing the stores to memory and the loop not unrolled. So in summary, there is nothing particular about this pattern that is turning off the loop unrolling, it has to do with the overall register pressure of the program. If you would like to ensure that the loops do get unrolled then the best method for this case would be to force the simdsize using the "intel_reqd_sub_group_size" attribute at the top of your kernel. Let me know if you have any questions or would like some assistance in setting this up. Thanks, |
Sorry, this was a typo if my message, and I meant "SRC_TYPE4 is float4" (thus "SRC_TYPE is float"). The size of the data is thus exactly the same than SRC_TYPE is uint, which doesn't trigger the spilling. |
So I am still seeing the spilling in the uint case but it is less than the float case. What I realized though is we were not honoring your "#pragma unroll". I have a fix coming in soon for this. Overall though with the loop being unrolled it is still going to increase overall register pressure causing spills. Feel free to contact me via email (see profile) for additional support/ questions. |
The bug report was done with the intel compute-runtime release 18.24.10921. I rested with the recently released 18.25.10965. It seems that this new version fixes several issues, in particular the spilling of table_of_float4 doesn't happen anymore if SRC_TYPE4 is float4. |
Hello Axel. If the new runtime release fixed your issues can this be closed? |
The part of the bug about format conversion occuring just after the memory load instead of just before the data is needed is not solved to my knowledge. I don't remember if among the code I sent, one enables to reproduce this issue. I will send a new code to highlight the issue. |
Hello Axel. I am not sure if I have received a new reproducer for this issue. Could you please send it, if this is still an issue. |
I sent it to your collegue, I will forward you. EDIT: no I was mistaken with the other bug report. I did send you the new reproducer a few weeks ago. I send it again. |
Hi Axel, It has been a while since the issue was last updated. Can you comment on the issue with latest code? |
Haven't heard anything recently regarding this issue. Please re-open if you still need help. |
…on) (#13) Co-authored-by: Dmitry Ryabtsev <[email protected]>
…on) (#13) Co-authored-by: Dmitry Ryabtsev <[email protected]>
I was told on #intel-gfx there was some interest reporting compiler inefficiencies here.
I have the following pattern with issues:
SRC_TYPE4 data = vload4(0, ptr_to_mydata);
if (something) {
some work
}
some work
table_of_float4[k] = convert_float4(data);
some work with table_of_float4[k]
This pattern is in a for loop. k is fixed. All accesses to the __private table_of_float4 have their indexes known in advance because all loops on it are unrolled, thus enabling to store it in registers.
If SRC_TYPE4 is uchar4, the conversion to float is moved before "if (something)", thus the latency is not well hidden and more registers are consumed. My program performance is 20% slower than with uint4.
if SRC_TYPE is float4 (no conversion needed), the compiler decides to store table_of_float4 in memory instead of registers, and the program performance is more than 10 times slower.
I used vtune to analyse the compiler output. I use the Neo driver on Skylake.
The code is intended to be published in the future, but I don't want to publish it here publicly yet. I can send it by mail to developpers if requested.
The text was updated successfully, but these errors were encountered: