use block size 128 and 256bit loading#3289
Conversation
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request updates the MXFP8 quantization kernels to support processing either 8 or 16 elements per thread, depending on the CUDA version. Key changes include the introduction of the MxFp8OutT type alias and an overload for fp32_vec_to_e4m3 to handle both uint64_t and uint4 output formats. Additionally, the block size for invokeMxFP8Quantization has been fixed to 128. I have no feedback to provide as there were no review comments to assess.
📌 Description
Apply the optimizations from (per-token) fp4 quantization kernel to mxfp8 quantization kernel:
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes