Skip to content

[BUG] XE_2D_U8x32x32_LD_N for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples #357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sanchitintel opened this issue May 5, 2025 · 8 comments
Labels
bug Something isn't working

Comments

@sanchitintel
Copy link

sanchitintel commented May 5, 2025

Describe the bug
If GmemTiledCopyA in

using GmemTiledCopyA = XE_2D_U8x32x32_LD_V;
is replaced with XE_2D_U8x32x32_LD_N, then the example doesn't run successfully.

The same behavior is observed when a similar change is made in

using GmemTiledCopyA = XE_2D_U16x32x32_LD_N;
.

Expected behavior
The examples should work even if A is not loaded in VNNI format

Environment details (please complete the following information):
Intel GPU Max 1550

Additional context

#245 had been opened to fix this problem earlier.

Also, I'm not sure what's the performance penalty because of loading A in VNNI format instead of plain format, so this issue might be low-key. For BF16xBF16 GEMM, reading A in VNNI format results in a ~14% performance penalty in the default example with the default input shapes.

Alternative

@pengzhao-intel suggested a workaround of using some XE_2D_U16 copy atoms for FP8 & int8 but later changing the layout to the desired one in the code.

Thanks!

@sanchitintel sanchitintel added the bug Something isn't working label May 5, 2025
@aacostadiaz
Copy link
Collaborator

Thanks for raising this issue. We'll look into adding the copy operation from #245.

Using the current XE_2D_U8x32x32_LD_N doesn’t work because it loads data in the layout required for int8 GEMM (with MMA_XE_8x16x32_S32S8S8S32_TT), which is different from the layout needed for the bf16/fp16 MMA operations used in mixed precision or FP8 GEMM.

Similarly, using XE_2D_U16 and then changing the layout after loading would require shuffling data between threads to get the format needed for the bf16/fp16 MMA. That would significantly impact performance.

@sanchitintel
Copy link
Author

sanchitintel commented May 9, 2025

Hi @aacostadiaz, thanks a lot for clarifying!

Can you please confirm if the _V suffix means VNNI format (I read that in the description of another issue here, but I'm not sure now).
Also, for using XMX engines, please advise if B be in VNNI format, and A in plain format (that's true for AMX instructions on Xeon, which is stated to be similar to XMX in Intel GPUs)? Thanks!

@aacostadiaz
Copy link
Collaborator

Exactly. _V stands for VNNI and should be used only for loading matrix B in row-major layout. _N is for loading matrix A in row-major layout. _T is for loading matrices A and B in column-major layout.

We are going to rename all these operations soon to match the naming used in https://github.khronos.org/SPIRV-Registry/extensions/INTEL/SPV_INTEL_2d_block_io.html We will also add some static assertions to prevent misuse like the one in this issue.

@sanchitintel
Copy link
Author

sanchitintel commented May 16, 2025

Thanks again for elaborating, @aacostadiaz!

Looks like #374 will fix this issue. :)
I have an older IGC at my end (doesn't support SPV_INTEL_2d_block_io), so I haven't been able to test it with the FP8 GEMM, though, but will do so asap.

Thank again for your help!

@aacostadiaz
Copy link
Collaborator

You can use this cmake option to use the builtins instead of spirv

-DCUTLASS_SYCL_BUILTIN_ENABLE=ON

@sanchitintel
Copy link
Author

sanchitintel commented May 21, 2025

@aacostadiaz, with that PR, when -DCUTLASS_SYCL_BUILTIN_ENABLE=ON is used, the FP8xFP8 GEMM performance is slightly worse when XE_2D_U8x32x32_LD_N is used to load A matrix elements.

@aacostadiaz
Copy link
Collaborator

I updated the PR. It should have the same functionality and performance as before. It looks like the copy operation in XE_2D_U8x32x32_LD_V is a bit faster than the one added in XE_2D_U8x32x32_LD_N, which is unexpected since we are loading the A matrix. I changed XE_2D_U8x32x32_LD_N to use the exact same operation we use for XE_2D_U8x32x32_LD_V, so you should get the same performance now. We'll investigate this further.

@sanchitintel
Copy link
Author

Thanks for the info, @aacostadiaz!

__builtin_IB_subgroup_block_read_flat_u8_m32k16v2 doesn't show up in documentation (although it works), whereas similar intrinsics can be found on GitHub. Is it possible that it may have been hidden from documentation because its performance is currently not very good?

@rolandschulz may also have some inputs on this.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants