fix fp8 kv cache dequantize kernels #3896

mxz297 · 2025-03-28T21:56:26Z

Summary:
Fix fp8 kv cache dequantization kernel and enable unit test on AMD.

The kernel uses each thread to dequantize 4 elements for both K and V and each warp for a head. The dim is always 128. So on NV this works as one warp has 32 threads on NV (4 * 32 = 128).

On AMD, each wavefront (warp) has 64 threads, so the second 32 threads will all do out-of-bound memory access....

This diff simply masks those threads to do nothing. Obviously the perf is not good but from E2E testing, it does not seem to matter. If we need to optimize the perf for AMD, we can let thread 0 ~ 31 dequantize 4 elements for K and thread 32 ~ 63 thread dequantize 4 elements for V.

Differential Revision: D72062745

Summary: Fix fp8 kv cache dequantization kernel and enable unit test on AMD. The kernel uses each thread to dequantize 4 elements for both K and V and each warp for a head. The dim is always 128. So on NV this works as one warp has 32 threads on NV (4 * 32 = 128). On AMD, each wavefront (warp) has 64 threads, so the second 32 threads will all do out-of-bound memory access.... This diff simply masks those threads to do nothing. Obviously the perf is not good but from E2E testing, it does not seem to matter. If we need to optimize the perf for AMD, we can let thread 0 ~ 31 dequantize 4 elements for K and thread 32 ~ 63 thread dequantize 4 elements for V. Differential Revision: D72062745

facebook-github-bot · 2025-03-28T21:56:40Z

This pull request was exported from Phabricator. Differential Revision: D72062745

netlify · 2025-03-28T21:56:49Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`8556aa5`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67e71b0d5fa2a90008d70819
😎 Deploy Preview	https://deploy-preview-3896--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2025-03-30T06:38:36Z

This pull request has been merged in a303797.

facebook-github-bot · 2025-04-02T04:29:53Z

This pull request has been reverted by 47635cf.

Summary: X-link: pytorch#3896 Pull Request resolved: facebookresearch/FBGEMM#987 Fix fp8 kv cache dequantization kernel and enable unit test on AMD. The kernel uses each thread to dequantize 4 elements for both K and V and each warp for a head. The dim is always 128. So on NV this works as one warp has 32 threads on NV (4 * 32 = 128). On AMD, each wavefront (warp) has 64 threads, so the second 32 threads will all do out-of-bound memory access.... This diff simply masks those threads to do nothing. Obviously the perf is not good but from E2E testing, it does not seem to matter. If we need to optimize the perf for AMD, we can let thread 0 ~ 31 dequantize 4 elements for K and thread 32 ~ 63 thread dequantize 4 elements for V. Reviewed By: Aya-ZIbra Differential Revision: D72062745 fbshipit-source-id: 1b813057586054a13df4e9088be00b08f912bc57

facebook-github-bot added the cla signed label Mar 28, 2025

facebook-github-bot added the fb-exported label Mar 28, 2025

facebook-github-bot closed this in a303797 Mar 30, 2025

facebook-github-bot added the Merged label Mar 30, 2025

facebook-github-bot added the Reverted label Apr 2, 2025

q10 added category:fix feature:genai labels Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix fp8 kv cache dequantize kernels #3896

fix fp8 kv cache dequantize kernels #3896

Uh oh!

mxz297 commented Mar 28, 2025

Uh oh!

facebook-github-bot commented Mar 28, 2025

Uh oh!

netlify bot commented Mar 28, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 30, 2025

Uh oh!

facebook-github-bot commented Apr 2, 2025

Uh oh!

Uh oh!

fix fp8 kv cache dequantize kernels #3896

fix fp8 kv cache dequantize kernels #3896

Uh oh!

Conversation

mxz297 commented Mar 28, 2025

Uh oh!

facebook-github-bot commented Mar 28, 2025

Uh oh!

netlify bot commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Mar 30, 2025

Uh oh!

facebook-github-bot commented Apr 2, 2025

Uh oh!

Uh oh!

netlify bot commented Mar 28, 2025 •

edited

Loading