[WIP] FP8 scaledMM with DeepSeek-style dequantization #453
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TODO
Some background -
Recently, an FP8 ScaledMM was added to cutlass (but it doesn't currently satisfy DeepSeek requirements for B matrix dequantization/scaling). It shares the same implementation as cutlass mixed dtype GEMM.
The original plan was to combine the source-code of the two implementations with compile time evaluated conditionals, but due to some IGC bugs, they're separate for now.
Anyway, both of those implementations are pretty slow right now due to some IGC bug.
Since I reused/copy-pasted A-scaling code from there, the scaled MM in this PR is also currently slow.
A lot of code in this PR has been duplicated, and would be refactored later.