[BUG] `XE_2D_U8x32x32_LD_N` for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples #357

sanchitintel · 2025-05-05T21:13:52Z

Describe the bug
If GmemTiledCopyA in

cutlass-sycl/examples/sycl/08_pvc_gemm_f8/08_pvc_gemm_f8.cpp

Line 360 in d48e30a

using GmemTiledCopyA = XE_2D_U8x32x32_LD_V;

is replaced with XE_2D_U8x32x32_LD_N, then the example doesn't run successfully.

The same behavior is observed when a similar change is made in

cutlass-sycl/examples/sycl/02_pvc_gemm_mixed_dtype/02_pvc_gemm_mixed_dtype.cpp

Line 261 in d48e30a

using GmemTiledCopyA = XE_2D_U16x32x32_LD_N;

.

Expected behavior
The examples should work even if A is not loaded in VNNI format

Environment details (please complete the following information):
Intel GPU Max 1550

Additional context

#245 had been opened to fix this problem earlier.

Also, I'm not sure what's the performance penalty because of loading A in VNNI format instead of plain format, so this issue might be low-key. For BF16xBF16 GEMM, reading A in VNNI format results in a ~14% performance penalty in the default example with the default input shapes.

Alternative

@pengzhao-intel suggested a workaround of using some XE_2D_U16 copy atoms for FP8 & int8 but later changing the layout to the desired one in the code.

Thanks!

The text was updated successfully, but these errors were encountered:

aacostadiaz · 2025-05-06T16:49:21Z

Thanks for raising this issue. We'll look into adding the copy operation from #245.

Using the current XE_2D_U8x32x32_LD_N doesn’t work because it loads data in the layout required for int8 GEMM (with MMA_XE_8x16x32_S32S8S8S32_TT), which is different from the layout needed for the bf16/fp16 MMA operations used in mixed precision or FP8 GEMM.

Similarly, using XE_2D_U16 and then changing the layout after loading would require shuffling data between threads to get the format needed for the bf16/fp16 MMA. That would significantly impact performance.

sanchitintel · 2025-05-09T08:07:59Z

Hi @aacostadiaz, thanks a lot for clarifying!

Can you please confirm if the _V suffix means VNNI format (I read that in the description of another issue here, but I'm not sure now).
Also, for using XMX engines, please advise if B be in VNNI format, and A in plain format (that's true for AMX instructions on Xeon, which is stated to be similar to XMX in Intel GPUs)? Thanks!

aacostadiaz · 2025-05-09T16:05:32Z

Exactly. _V stands for VNNI and should be used only for loading matrix B in row-major layout. _N is for loading matrix A in row-major layout. _T is for loading matrices A and B in column-major layout.

We are going to rename all these operations soon to match the naming used in https://github.khronos.org/SPIRV-Registry/extensions/INTEL/SPV_INTEL_2d_block_io.html We will also add some static assertions to prevent misuse like the one in this issue.

sanchitintel · 2025-05-16T07:21:10Z

Thanks again for elaborating, @aacostadiaz!

Looks like #374 will fix this issue. :)
I have an older IGC at my end (doesn't support SPV_INTEL_2d_block_io), so I haven't been able to test it with the FP8 GEMM, though, but will do so asap.

Thank again for your help!

aacostadiaz · 2025-05-16T09:52:32Z

You can use this cmake option to use the builtins instead of spirv

-DCUTLASS_SYCL_BUILTIN_ENABLE=ON

sanchitintel · 2025-05-21T21:03:20Z

@aacostadiaz, with that PR, when -DCUTLASS_SYCL_BUILTIN_ENABLE=ON is used, the FP8xFP8 GEMM performance is slightly worse when XE_2D_U8x32x32_LD_N is used to load A matrix elements.

aacostadiaz · 2025-05-27T17:23:25Z

I updated the PR. It should have the same functionality and performance as before. It looks like the copy operation in XE_2D_U8x32x32_LD_V is a bit faster than the one added in XE_2D_U8x32x32_LD_N, which is unexpected since we are loading the A matrix. I changed XE_2D_U8x32x32_LD_N to use the exact same operation we use for XE_2D_U8x32x32_LD_V, so you should get the same performance now. We'll investigate this further.

sanchitintel · 2025-05-29T03:19:31Z

Thanks for the info, @aacostadiaz!

__builtin_IB_subgroup_block_read_flat_u8_m32k16v2 doesn't show up in documentation (although it works), whereas similar intrinsics can be found on GitHub. Is it possible that it may have been hidden from documentation because its performance is currently not very good?

@rolandschulz may also have some inputs on this.

Thanks!

sanchitintel added the bug Something isn't working label May 5, 2025

sanchitintel mentioned this issue May 30, 2025

Add U8 copy operation for K16 MMA #374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] `XE_2D_U8x32x32_LD_N` for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples #357

[BUG] `XE_2D_U8x32x32_LD_N` for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples #357

sanchitintel commented May 5, 2025 •

edited

Loading

aacostadiaz commented May 6, 2025

Uh oh!

sanchitintel commented May 9, 2025 •

edited

Loading

Uh oh!

aacostadiaz commented May 9, 2025

Uh oh!

sanchitintel commented May 16, 2025 •

edited

Loading

Uh oh!

aacostadiaz commented May 16, 2025

Uh oh!

sanchitintel commented May 21, 2025 •

edited

Loading

Uh oh!

aacostadiaz commented May 27, 2025

Uh oh!

sanchitintel commented May 29, 2025

Uh oh!

[BUG] XE_2D_U8x32x32_LD_N for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples #357

[BUG] XE_2D_U8x32x32_LD_N for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples #357

Comments

sanchitintel commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

aacostadiaz commented May 6, 2025

Uh oh!

sanchitintel commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aacostadiaz commented May 9, 2025

Uh oh!

sanchitintel commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aacostadiaz commented May 16, 2025

Uh oh!

sanchitintel commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aacostadiaz commented May 27, 2025

Uh oh!

sanchitintel commented May 29, 2025

Uh oh!

[BUG] `XE_2D_U8x32x32_LD_N` for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples #357

[BUG] `XE_2D_U8x32x32_LD_N` for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples #357

sanchitintel commented May 5, 2025 •

edited

Loading

sanchitintel commented May 9, 2025 •

edited

Loading

sanchitintel commented May 16, 2025 •

edited

Loading

sanchitintel commented May 21, 2025 •

edited

Loading