Skip to content

Over generation of 'mov' instruction for some kernels #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
axeldavy opened this issue Jul 16, 2018 · 9 comments
Closed

Over generation of 'mov' instruction for some kernels #18

axeldavy opened this issue Jul 16, 2018 · 9 comments

Comments

@axeldavy
Copy link

Hi,

For some kernels, a lot of useless mov operations are generated (usually after an if conditional). The mov operations can be avoided by just writing to the correct registers in the if condition or equivalent.
I noticed specifying 'cl-std=CL2.0' helps workaround the issue in some cases, however not in all cases.
I've sent an example by mail to Alexander Paige, but I open this bug report to keep track of the issue.

@paigeale
Copy link
Contributor

Thanks Axel for posting this. I will have someone on our team get to this shortly.

@iwwu
Copy link
Contributor

iwwu commented Aug 8, 2018

Hi Axel,

I have managed to generate the assembly file using your code.cl file

  • My .asm file matches your .csv file visually. I specifically compared blocks 1 to 17, and they match with your description. Block 14, 15, 16 contain movs mostly.
  • I also confirmed that –cl-std=CL2.0 doesn’t alter the assembly output. You mentioned that CL2.0 worked around for some kernels, but not with this sample code.

I'll continue the investigation and keep you posted.

@iwwu
Copy link
Contributor

iwwu commented Aug 15, 2018

It is not conclusive that excessive 'mov' instructions are generated after comparing with 'phi' operations in LLVM file. One experiment that I did is to disable loop unroll in IGC, I can see that 'mov's are reduced significantly reduced as expected.

@axeldavy
Copy link
Author

If I compare the csv I sent you and the code, I can guess which block corresponds to what.

In pseudo code:
1: loop on dx
2: if{}
3: loop on dy
4: if{} with 19 unavoidable mov operations
5: barrier
6: if{}
7: barrier
8: if {if {}}
9: end of loop dy
10: end of loop dx

On the generated csv:
1: block 9 (for the init) then 10
2: blocks 11, 12, 13
3: block 14 (for the init) then 15
4: end of block 15, blocks 16, 17, 18, 19
5: middle of block 19
6: end of block 19, block 20 and block 21
7: block 22
8: blocks 23, 24, 25 and 26
9: blocks 27 28 and 29
10: blocks 30 31 and 32

I counted:
18 movs: block 19
19 movs: blocks 10, 11, 12, 15, 16, 28, 31
20 movs: block 17
21 movs: block 14

Thus the 19 items that are shifted inside a table (which are the 19 unavoidable movs) are somehow moved in a lot of different registers in all these different blocks. Only the movs in block 17 seem neccessary and all the others seem avoidable.
If you disable loop unrolling, the table will stop being stored in registers (it is declared private) and thus will remove the movs, but for performance it should be stored in registers.

@iwwu
Copy link
Contributor

iwwu commented Aug 15, 2018

Hi Axel, can you share sample code that got worked around with CL2.0?

@axeldavy
Copy link
Author

I'm not able to reproduce any issue I had when removing '-cl-std=CL2.0'. I used to have some codes which would over-generate movs when removing the flag.

Either my codes won't reproduce the issue because of the modifications I added since, or because it was fixed in the driver. I will bisect trying to find that answer.

@axeldavy
Copy link
Author

I have managed to reproduce the issue by adapting older version of my code. For some reason my more recent code won't generate the issue, I will send the code to paigeale.

@paigeale
Copy link
Contributor

Hello Axel. Thank you for the simple reproducer. I have identified the extra mov's you have reported and have identified the source of these mov's. During our DeSSA pass we evaluate phi instruction using congruent classes to determine if we can potentially coalesce the operands of the phi. In the case you sent me we are seeing a lot of interference when trying to combine the phi operands thus you are seeing these extra mov's being created in the asm. I am working on a way to improve our DeSSA pass to better construct these congruent classes . Thank you for your patience.

@paigeale
Copy link
Contributor

Hello Axel. After investigating further into our DeSSA coalescing algorithm there is not much we can do on this case to improve the overall mov count. Phi-Elimination is np complete which means we cannot guarantee the best outcome for each individual program, but our algorithm works well in most cases. In your case what is unique is the phi looping that is being done, the phi that is chosen to be the lead node of the congruence class ends up interfering with many different phi's thus we end up isolating each of the phi's which creates these additional movs. We cannot handle the additional mov's after the fact for we do not do any global coalescing due to the structure of our compiler (not having an intermediate representation between llvm and virtual asm). I would advise changing the kernel code at this point in time. If you need any recommendations on what to change feel free to contact me via email. Thank you again for posting this issue, we look forward to hearing more from you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants