unroll 2 loops, int64_t -> int, 309 µs #4
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR has three very simple performance optimizations:
#pragma unroll
. Unrolling loops (especially the inner ones) is generally faster but the compiler tends to be conservative with it because loop unrolling increases register pressure and therefore reduces occupancy. This ensures that even in the worst case scenario the performance will not be terrible but in my experience it is better to explicitly tell the compiler to unroll more loops. Refactor the loops as I did for two of them and add more#pragma unroll
. Start with the inner loops and work your way outwards while monitoring performance and register pressure.int64_t
toint
. CUDA has 32 bit registers so data types with that size are the fastest to work with. The maximum values for the loop variables is much smaller than the maximum value ofint
so there is no benefit to usingint64_t
. It only makes the code slower.On my system (1x RTX 3090) the runtime for the
flash_attn_ext_f16
kernel has decreased from 383 µs to 309 µs (1.23x faster).