Skip to content

[QST] Ambiguous error message when tuning the Tileshape #376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cfgfung opened this issue May 14, 2025 · 7 comments
Open

[QST] Ambiguous error message when tuning the Tileshape #376

cfgfung opened this issue May 14, 2025 · 7 comments
Labels
question Further information is requested

Comments

@cfgfung
Copy link

cfgfung commented May 14, 2025

This happened when I was trying to change the tile shape of example 08_pvc_gemm_fp8.cpp.

I modified line 363 and line 368 using the following codes:

using TileShape = Shape<_128, _128, _32>;  //changed the m and n tile shape

  using TiledMma =
      typename TiledMMAHelper<MMA_Atom<XE_8x16x16_F32F16F16F32_TT>, Layout<TileShape>,
      Layout<Shape<_8, _4, _1>, Stride<_2, _1, _0>>>::TiledMMA; // changed the stride from 4 to 2 to adapt the tileshape

Then it gave the following error message:

/include/cute/atom/copy_traits_xe.hpp:441:17: error: static assertion failed due to requirement 'size(res) > 0': Error in make_fragment_layout(), tile size might be smaller than copy atom
441 | static_assert(size(res) > 0, "Error in make_fragment_layout(), tile size might be smaller than copy atom");

The shape of the specified Gmemtile is 32x32 (XE_2D_U8x32x32_LD_V), and that is larger than the MMA_Atom size (XE_8x16x16_F32F16F16F32_TT).

What is the true meaning of this error message?

It would also be very helpful if you would shed some lights on the complex relationship between TileShape, GMemTiledCopy and TiledMMA. Right now I am not sure how to define the remainings when given a specified tileshape.

@cfgfung cfgfung added the question Further information is requested label May 14, 2025
@sanchitintel
Copy link

changed the stride from 4 to 2 to adapt the tileshape

Sorry, but why is this change needed?

@cfgfung
Copy link
Author

cfgfung commented May 15, 2025

changed the stride from 4 to 2 to adapt the tileshape

Sorry, but why is this change needed?

Thanks for the prompt response. I think it is a confusion with the definition of Nvidia's TiledMMA. Should this be 4 in this case?

@sanchitintel
Copy link

sanchitintel commented May 15, 2025

Should this be 4 in this case?

Yes, I think so.

FWIW, the following change works by modifying the BF16xBF16 GEMM example -

Image

but a similar change for FP8 (with U16 copy atoms changed to U8 copy atoms) doesn't work by modifying the FP8xFP8 GEMM because of some error related to copy atoms. It'd be great if it could be resolved, though, so that we may be able to use smaller TiledMMA tile-sizes with FP8 GEMMs.

@joeatodd
Copy link
Collaborator

Hello,

Just to briefly explain:

Layout<Shape<_8, _4, _1>, Stride<_4, _1, _0>>

This is a Layout describing how many sub-groups we have in our work-group, and how they are arranged in M,N,K dimensions. In this example we have 8x4x1 (32) subgroups, with 8 in the M dimension, 4 in the N dimension and 1 in the K dimension. The Stride describes how these are arranged; in this case, N is the fastest moving dimension. So, SG0,SG1,SG2,SG3 are the first row, and then SG4 will be directly 'below' SG0 in the M direction. The stride (_4) tells us this: step forward 4 sub-groups to find the next SG in the M dimension.

Setting Stride<_2, _1, _0> doesn't work because then we try to step forward 2 sub-groups to move in the M dimension (but we have 4 subgroups in the N dimension). Here is a PR which would catch this error.

For your example, because you've dropped the tile size from <256, 256, 32> to <128, 128, 32>, you could consider using:

Layout<Shape<_4, _2, _1>, Stride<_2, _1, _0>

@joeatodd
Copy link
Collaborator

joeatodd commented May 15, 2025

It would also be very helpful if you would shed some lights on the complex relationship between TileShape, GMemTiledCopy and TiledMMA. Right now I am not sure how to define the remainings when given a specified tileshape.

If we have a certain TileShape (128x128x32), we must choose a combination of sub-group layout (WarpLayout argument to TiledMMAHelper) and GMemTiledCopy which 'fits'.

I believe the reason your change doesn't work is because you have (8, 4, 1):(4, 1, 0) for your WarpLayout. This means, for example, we have 8 sub-groups/warps in the M dimension, but our TileShape in that dimension is 128. This means each sub-group should load a block which is 16 (128 values / 8 subgroups) values in the M dimension, but the GmemTiledCopy you are using is XE_2D_U8x32x32_LD_V (32x32 values). Note that going the other way isn't a problem - i.e. it's totally fine to use a 32x32 block load when each sub-group must load 64x64 values, because the sub-group will perform 4 iterations of the 32x32 block load. But you can't split a block load.

So, you either need to:

  • Use a smaller number of sub-groups in that dimension (so each sub-group has 32 values to load)
  • Use a smaller block load operation (in this case 16x16). This is what @sanchitintel proposes.

Unfortunately, for the second solution, you are limited by the available block load operations. I think there isn't a U8x16x16_LD_V operation at present.

@sanchitintel
Copy link

sanchitintel commented May 15, 2025

Hi @joeatodd, thanks a lot for the detailed explanation!
Please help resolve a doubt.

Use a smaller block load operation (in this case 16x16).

If the tile shape is (128, 128, 32), and there are 32 subgroups with layout (8, 4, 1),
M dimension of the subgroup tile would be 16, N would be 32, and K would also be 32.

So, would it be okay to use a 16x32 block load operation for A, and 32x32 block load operation for B?
What's the reason behind using 16x16 instead?
Thanks!

Use a smaller number of sub-groups

Thanks for the tip! Prima facie, it seems this approach may lead to better performance on the GPU I'm using (Intel GPU Max 1550) with smaller tile shape since the hardware has 8 EUs per Xe core.

@cfgfung
Copy link
Author

cfgfung commented May 15, 2025

It would also be very helpful if you would shed some lights on the complex relationship between TileShape, GMemTiledCopy and TiledMMA. Right now I am not sure how to define the remainings when given a specified tileshape.

If we have a certain TileShape (128x128x32), we must choose a combination of sub-group layout (WarpLayout argument to TiledMMAHelper) and GMemTiledCopy which 'fits'.

I believe the reason your change doesn't work is because you have (8, 4, 1):(4, 1, 0) for your WarpLayout. This means, for example, we have 8 sub-groups/warps in the M dimension, but our TileShape in that dimension is 128. This means each sub-group should load a block which is 16 (128 values / 8 subgroups) values in the M dimension, but the GmemTiledCopy you are using is XE_2D_U8x32x32_LD_V (32x32 values). Note that going the other way isn't a problem - i.e. it's totally fine to use a 32x32 block load when each sub-group must load 64x64 values, because the sub-group will perform 4 iterations of the 32x32 block load. But you can't split a block load.

So, you either need to:

* Use a smaller number of sub-groups in that dimension (so each sub-group has 32 values to load)

* Use a smaller block load operation (in this case `16x16`). This is what [@sanchitintel](https://github.com/sanchitintel) [proposes](https://github.com/codeplaysoftware/cutlass-sycl/issues/376#issuecomment-2882779887).

Unfortunately, for the second solution, you are limited by the available block load operations. I think there isn't a U8x16x16_LD_V operation at present.

Highly appreciate the detail explanations, especially regarding to the split block load!

@cfgfung cfgfung closed this as completed May 18, 2025
@cfgfung cfgfung reopened this May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants