First version of FP8 scaled_mm. #428

cfgfung · 2025-06-11T22:21:59Z

This is the initial implementation of scaled_mm(), enabling element-wise scaling for A & B matrices (scales of A & B have to be of the same size as A or B, so this implementation is agnostic of the quantization scheme).

Based on the existing W8A8 normal GEMM, this version focuses on delivering core functionality.

@sanchitintel will use this PR as a foundation for further optimizations in subsequent PRs:

Passing FP32 scales instead of FP16 (this PR)
This PR requires scale tensors to be of the same size as A and B. They are created in the FW side & passed to cutlass, so there's a creation overhead, the memory bandwidth requirements increases, and register spilling also happens. Ideally, the framework should pass scales as is.

examples/sycl/09_bmg_scaled_mm_f8/09_bmg_scaled_mm_f8.cpp

include/cutlass/fp8_to_fp16.h

sanchitintel

@cfgfung, I'll defer to the cutlass team for review, and will add some comments later.
But maybe you can change the state of this PR to draft to rebase later.
The corresponding integration code with IPEX is failing accuracy, BTW.

cfgfung · 2025-06-13T00:11:23Z

@cfgfung, I'll defer to the cutlass team for review, and will add some comments later. But maybe you can change the state of this PR to draft to rebase later. The corresponding integration code with IPEX is failing accuracy, BTW.

You may wish to follow up with IPEX team and check what is the root cause of the failing accuracy. This code passes the verify()/ unit test. I have granted the rights to you. You should be able to commit/change the code.

t4c1 · 2025-06-16T09:43:56Z

applications/scaled_mm/collective/xe_scaled_mm_mma_fp8.hpp

+ *
+ **************************************************************************************************/
+ /*
+ * This implements the scaled_mm for W8A8 GEMM (FP8 weights and activations) using FP16 compute as a workaround,


What is scaled_mm? Can you express that in CUTLASS terms?

Thanks for the review.
This means matrix multiplication with scaling factors.
(scaleA.*TensorA) @ (scaleB.*TensorB)

@cfgfung, please add info on the granularity of scales in the code as well as the PR description, since this scaling is specific to DeepSeek.

For example, for 128x128 A & B blocks, A scale is applied at the granularity of per-token-per-128-channel sub-vector to an A block, and B scale is applied to the whole B block.

For this PR, however, this granularity doesn't matter to the collective & the kernel, because the cutlass library user has to ensure that the scales of A & B are of the same size as A & B, so the scaling in this PR is elementwise.

I'll make it more generic in my subsequent PR, so that various workgroup tile-sizes could work with various quantization block sizes.
However, can you please rename the collective & kernel files to indicate that the quantization is supposed to be DeepSeek-style block-wise? In that case, I could modify the same files. Thanks

Shall we call this scaled F8 GEMM? I do not think weights and activations are terms that are relevant on CUTLASS level. I am also not against adding more details about how scaling is done.

Hi @t4c1, this implementation requires users to pass scales at the granularity of each element of A or B matrices (so scale matrices also have to be of the same size as A or B). That makes this implementation quantization-scheme agnostic. It can't be used in real workloads, though.

applications/scaled_mm/collective/xe_scaled_mm_mma_fp8.hpp

applications/scaled_mm/kernel/xe_scaled_mm_fp8.hpp

t4c1 · 2025-06-16T11:16:47Z

examples/sycl/09_bmg_scaled_mm_f8/09_bmg_scaled_mm_f8.cpp

+
+  template <typename SrcT, typename ScaleT>
+  void elementwise_multiply_scale(SrcT* d_src, size_t size, ScaleT* d_scale){
+      SrcT* h_src_multiplied = new SrcT[size];


use std::vector instead of directly calling new

The sycl::memcpy() cannot take the vector as the argument.
I added delete[] instead to handle the memory issue.

sycl::memcpy() cannot take the vector as the argument

The pointer pertaining to a vector could have been used, though.

Regardless, it doesn't matter, as we shouldn't do this computation on CPU, so please change it to something like https://github.com/mehdi-goli/cutlass-fork/blob/884fa6a8b94adddba9f32b41b6a9d011e1642217/examples/sycl/06_bmg_flash_attention/bmg_flash_attn_decode_runner.hpp#L202-L207

This comment was not addressed. Is anything unclear here?

examples/sycl/09_bmg_scaled_mm_f8/09_bmg_scaled_mm_f8.cpp

examples/sycl/09_bmg_scaled_mm_f8/CMakeLists.txt

This will enable the W8A8 Block quantized matmul for DeepSeek-R1 model.

cfgfung · 2025-06-20T17:47:58Z

@t4c1 @sanchitintel
Thanks for the reviews and comments. I have updated and rebased the code to address the comments.

t4c1

Please add some tests.

Also can you clarify in what way this differs from the changes from #450.

t4c1 · 2025-07-02T14:53:58Z

applications/scaled_mm/collective/xe_scaled_mm_mma_fp8.hpp

-  using GmemTiledCopyScaleA = XE_2D_U16x32x32_LD_N; //Have to use the same shape size as FP8 used in the kernel
-  using GmemTiledCopyScaleB = XE_2D_U16x32x32_LD_N; //Have to use the same shape size as FP8 used in the kernel  
+  using GmemTiledCopyScaleA = XE_2D_U16x32x32_LD_N; // Shape of the copy atom for scales A must match shape of the copy atom for A in the number of elements
+  using GmemTiledCopyScaleB = XE_2D_U16x32x32_LD_N; // Shape of the copy atom for scales A must match shape of the copy atom for A in the number of elements


Suggested change

using GmemTiledCopyScaleB = XE_2D_U16x32x32_LD_N; // Shape of the copy atom for scales A must match shape of the copy atom for A in the number of elements

using GmemTiledCopyScaleB = XE_2D_U16x32x32_LD_N; // Shape of the copy atom for scales B must match shape of the copy atom for B in the number of elements

t4c1 · 2025-07-02T14:59:38Z

examples/sycl/09_bmg_scaled_mm_f8/09_bmg_scaled_mm_f8.cpp

+
+  template <typename SrcT, typename ScaleT>
+  void elementwise_multiply_scale(SrcT* d_src, size_t size, ScaleT* d_scale){
+      SrcT* h_src_multiplied = new SrcT[size];


This comment was not addressed. Is anything unclear here?

cfgfung · 2025-07-08T01:33:34Z

Closing this as it is duplicating the efforts with others.

sanchitintel reviewed Jun 12, 2025

View reviewed changes

examples/sycl/09_bmg_scaled_mm_f8/09_bmg_scaled_mm_f8.cpp Outdated Show resolved Hide resolved

sanchitintel reviewed Jun 12, 2025

View reviewed changes

include/cutlass/fp8_to_fp16.h Outdated Show resolved Hide resolved

sanchitintel reviewed Jun 12, 2025

View reviewed changes

sanchitintel marked this pull request as draft June 12, 2025 08:22

t4c1 reviewed Jun 16, 2025

View reviewed changes

cfgfung added 2 commits June 21, 2025 01:29

First version of FP8 scaled_mm.

5d68822

This will enable the W8A8 Block quantized matmul for DeepSeek-R1 model.

Addressed the comments.

2573e3b

cfgfung force-pushed the raymond/scaled_mm branch from ef7ff20 to 2573e3b Compare June 20, 2025 17:46

Use the new convert_FP8_to_FP16().

39a49ef

t4c1 reviewed Jul 2, 2025

View reviewed changes

cfgfung closed this Jul 8, 2025

	using GmemTiledCopyScaleB = XE_2D_U16x32x32_LD_N; // Shape of the copy atom for scales A must match shape of the copy atom for A in the number of elements
	using GmemTiledCopyScaleB = XE_2D_U16x32x32_LD_N; // Shape of the copy atom for scales B must match shape of the copy atom for B in the number of elements

First version of FP8 scaled_mm. #428

First version of FP8 scaled_mm. #428

Uh oh!

Conversation

cfgfung commented Jun 11, 2025 • edited by sanchitintel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanchitintel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cfgfung commented Jun 13, 2025

Uh oh!

t4c1 Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

cfgfung Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchitintel Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t4c1 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

sanchitintel Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

t4c1 Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

cfgfung Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

sanchitintel Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

t4c1 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cfgfung commented Jun 20, 2025

Uh oh!

t4c1 left a comment

Choose a reason for hiding this comment

Uh oh!

t4c1 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

t4c1 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

cfgfung commented Jul 8, 2025

Uh oh!

Uh oh!

cfgfung commented Jun 11, 2025 •

edited by sanchitintel

Loading

sanchitintel left a comment •

edited

Loading

cfgfung Jun 20, 2025 •

edited

Loading

sanchitintel Jun 25, 2025 •

edited

Loading

sanchitintel Jul 2, 2025 •

edited

Loading

sanchitintel Jun 25, 2025 •

edited

Loading