Skip to content

[SM90] Change register allocation for TileN=208 to avoid spills #2219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 21, 2025

Conversation

tridao
Copy link
Contributor

@tridao tridao commented Apr 4, 2025

With the usual register allocation (producer 40, consumer 232) compiling Gemm with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of register spilling (e.g. ~3000 bytes spill). For this case we can change the register allocation to producer 24, consumer 240, which avoids spills.
Cc @thakkarV @hwu36

With the usual register allocation (producer 40, consumer 232) compiling Gemm
with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of
register spilling (e.g. ~3000 bytes spill). For this case we can change
the register allocation to producer 24, consumer 240, which avoids spills.
@thakkarV
Copy link
Collaborator

thakkarV commented Apr 4, 2025

This is awesome! Thank you :D

@thakkarV
Copy link
Collaborator

thakkarV commented Apr 4, 2025

@hwu36 @Junkai-Wu

@thakkarV
Copy link
Collaborator

thakkarV commented Apr 4, 2025

Were you able to AB test perf results by any chance?

@tridao
Copy link
Contributor Author

tridao commented Apr 4, 2025

Perfwise nothing changes for the usual tile shapes since we keep reg allocation the same for those. For tile shape 256 x 208 it's obv a lot faster without the massive spill (5-10x). In some niche cases where 256 x 208 fits just right and there's minimal wave quantization 256 x 208 can be a good choice.

@hwu36 hwu36 merged commit ade6376 into NVIDIA:main Apr 21, 2025
@tridao tridao deleted the tridao/regalloc-208 branch June 8, 2025 15:56
andralex pushed a commit to andralex/cutlass that referenced this pull request Jun 14, 2025
…IA#2219)

With the usual register allocation (producer 40, consumer 232) compiling Gemm
with tile shape 256 x 208 (cooperative) or 128 x 208 (pingpong) show lots of
register spilling (e.g. ~3000 bytes spill). For this case we can change
the register allocation to producer 24, consumer 240, which avoids spills.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants