-
Notifications
You must be signed in to change notification settings - Fork 30
[BUG] XE_2D_U8x32x32_LD_N
for copying A doesn't work for FP8 GEMM & for mixed dtype GEMM examples
#357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for raising this issue. We'll look into adding the copy operation from #245. Using the current XE_2D_U8x32x32_LD_N doesn’t work because it loads data in the layout required for int8 GEMM (with MMA_XE_8x16x32_S32S8S8S32_TT), which is different from the layout needed for the bf16/fp16 MMA operations used in mixed precision or FP8 GEMM. Similarly, using XE_2D_U16 and then changing the layout after loading would require shuffling data between threads to get the format needed for the bf16/fp16 MMA. That would significantly impact performance. |
Hi @aacostadiaz, thanks a lot for clarifying! Can you please confirm if the |
Exactly. _V stands for VNNI and should be used only for loading matrix B in row-major layout. _N is for loading matrix A in row-major layout. _T is for loading matrices A and B in column-major layout. We are going to rename all these operations soon to match the naming used in https://github.khronos.org/SPIRV-Registry/extensions/INTEL/SPV_INTEL_2d_block_io.html We will also add some static assertions to prevent misuse like the one in this issue. |
Thanks again for elaborating, @aacostadiaz! Looks like #374 will fix this issue. :) Thank again for your help! |
You can use this cmake option to use the builtins instead of spirv -DCUTLASS_SYCL_BUILTIN_ENABLE=ON |
@aacostadiaz, with that PR, when |
I updated the PR. It should have the same functionality and performance as before. It looks like the copy operation in XE_2D_U8x32x32_LD_V is a bit faster than the one added in XE_2D_U8x32x32_LD_N, which is unexpected since we are loading the A matrix. I changed XE_2D_U8x32x32_LD_N to use the exact same operation we use for XE_2D_U8x32x32_LD_V, so you should get the same performance now. We'll investigate this further. |
Thanks for the info, @aacostadiaz!
@rolandschulz may also have some inputs on this. Thanks! |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
If
GmemTiledCopyA
incutlass-sycl/examples/sycl/08_pvc_gemm_f8/08_pvc_gemm_f8.cpp
Line 360 in d48e30a
XE_2D_U8x32x32_LD_N
, then the example doesn't run successfully.The same behavior is observed when a similar change is made in
cutlass-sycl/examples/sycl/02_pvc_gemm_mixed_dtype/02_pvc_gemm_mixed_dtype.cpp
Line 261 in d48e30a
Expected behavior
The examples should work even if
A
is not loaded in VNNI formatEnvironment details (please complete the following information):
Intel GPU Max 1550
Additional context
#245 had been opened to fix this problem earlier.
Also, I'm not sure what's the performance penalty because of loading
A
in VNNI format instead of plain format, so this issue might be low-key. For BF16xBF16 GEMM, readingA
in VNNI format results in a ~14% performance penalty in the default example with the default input shapes.Alternative
@pengzhao-intel suggested a workaround of using some
XE_2D_U16
copy atoms for FP8 & int8 but later changing the layout to the desired one in the code.Thanks!
The text was updated successfully, but these errors were encountered: