[QST] Ambiguous error message when tuning the Tileshape #376

cfgfung · 2025-05-14T17:51:06Z

This happened when I was trying to change the tile shape of example 08_pvc_gemm_fp8.cpp.

I modified line 363 and line 368 using the following codes:

using TileShape = Shape<_128, _128, _32>;  //changed the m and n tile shape

  using TiledMma =
      typename TiledMMAHelper<MMA_Atom<XE_8x16x16_F32F16F16F32_TT>, Layout<TileShape>,
      Layout<Shape<_8, _4, _1>, Stride<_2, _1, _0>>>::TiledMMA; // changed the stride from 4 to 2 to adapt the tileshape

Then it gave the following error message:

/include/cute/atom/copy_traits_xe.hpp:441:17: error: static assertion failed due to requirement 'size(res) > 0': Error in make_fragment_layout(), tile size might be smaller than copy atom
441 | static_assert(size(res) > 0, "Error in make_fragment_layout(), tile size might be smaller than copy atom");

The shape of the specified Gmemtile is 32x32 (XE_2D_U8x32x32_LD_V), and that is larger than the MMA_Atom size (XE_8x16x16_F32F16F16F32_TT).

What is the true meaning of this error message?

It would also be very helpful if you would shed some lights on the complex relationship between TileShape, GMemTiledCopy and TiledMMA. Right now I am not sure how to define the remainings when given a specified tileshape.

The text was updated successfully, but these errors were encountered:

sanchitintel · 2025-05-14T18:51:59Z

changed the stride from 4 to 2 to adapt the tileshape

Sorry, but why is this change needed?

cfgfung · 2025-05-15T02:42:35Z

changed the stride from 4 to 2 to adapt the tileshape

Sorry, but why is this change needed?

Thanks for the prompt response. I think it is a confusion with the definition of Nvidia's TiledMMA. Should this be 4 in this case?

sanchitintel · 2025-05-15T07:02:12Z

Should this be 4 in this case?

Yes, I think so.

FWIW, the following change works by modifying the BF16xBF16 GEMM example -

but a similar change for FP8 (with U16 copy atoms changed to U8 copy atoms) doesn't work by modifying the FP8xFP8 GEMM because of some error related to copy atoms. It'd be great if it could be resolved, though, so that we may be able to use smaller TiledMMA tile-sizes with FP8 GEMMs.

joeatodd · 2025-05-15T11:43:15Z

Hello,

Just to briefly explain:

Layout<Shape<_8, _4, _1>, Stride<_4, _1, _0>>

This is a Layout describing how many sub-groups we have in our work-group, and how they are arranged in M,N,K dimensions. In this example we have 8x4x1 (32) subgroups, with 8 in the M dimension, 4 in the N dimension and 1 in the K dimension. The Stride describes how these are arranged; in this case, N is the fastest moving dimension. So, SG0,SG1,SG2,SG3 are the first row, and then SG4 will be directly 'below' SG0 in the M direction. The stride (_4) tells us this: step forward 4 sub-groups to find the next SG in the M dimension.

Setting Stride<_2, _1, _0> doesn't work because then we try to step forward 2 sub-groups to move in the M dimension (but we have 4 subgroups in the N dimension). Here is a PR which would catch this error.

For your example, because you've dropped the tile size from <256, 256, 32> to <128, 128, 32>, you could consider using:

Layout<Shape<_4, _2, _1>, Stride<_2, _1, _0>

joeatodd · 2025-05-15T11:51:24Z

It would also be very helpful if you would shed some lights on the complex relationship between TileShape, GMemTiledCopy and TiledMMA. Right now I am not sure how to define the remainings when given a specified tileshape.

If we have a certain TileShape (128x128x32), we must choose a combination of sub-group layout (WarpLayout argument to TiledMMAHelper) and GMemTiledCopy which 'fits'.

I believe the reason your change doesn't work is because you have (8, 4, 1):(4, 1, 0) for your WarpLayout. This means, for example, we have 8 sub-groups/warps in the M dimension, but our TileShape in that dimension is 128. This means each sub-group should load a block which is 16 (128 values / 8 subgroups) values in the M dimension, but the GmemTiledCopy you are using is XE_2D_U8x32x32_LD_V (32x32 values). Note that going the other way isn't a problem - i.e. it's totally fine to use a 32x32 block load when each sub-group must load 64x64 values, because the sub-group will perform 4 iterations of the 32x32 block load. But you can't split a block load.

So, you either need to:

Use a smaller number of sub-groups in that dimension (so each sub-group has 32 values to load)
Use a smaller block load operation (in this case 16x16). This is what @sanchitintel proposes.

Unfortunately, for the second solution, you are limited by the available block load operations. I think there isn't a U8x16x16_LD_V operation at present.

sanchitintel · 2025-05-15T18:28:34Z

Hi @joeatodd, thanks a lot for the detailed explanation!
Please help resolve a doubt.

Use a smaller block load operation (in this case 16x16).

If the tile shape is (128, 128, 32), and there are 32 subgroups with layout (8, 4, 1),
M dimension of the subgroup tile would be 16, N would be 32, and K would also be 32.

So, would it be okay to use a 16x32 block load operation for A, and 32x32 block load operation for B?
What's the reason behind using 16x16 instead?
Thanks!

Use a smaller number of sub-groups

Thanks for the tip! Prima facie, it seems this approach may lead to better performance on the GPU I'm using (Intel GPU Max 1550) with smaller tile shape since the hardware has 8 EUs per Xe core.

cfgfung · 2025-05-15T21:45:58Z

It would also be very helpful if you would shed some lights on the complex relationship between TileShape, GMemTiledCopy and TiledMMA. Right now I am not sure how to define the remainings when given a specified tileshape.

If we have a certain TileShape (128x128x32), we must choose a combination of sub-group layout (WarpLayout argument to TiledMMAHelper) and GMemTiledCopy which 'fits'.

I believe the reason your change doesn't work is because you have (8, 4, 1):(4, 1, 0) for your WarpLayout. This means, for example, we have 8 sub-groups/warps in the M dimension, but our TileShape in that dimension is 128. This means each sub-group should load a block which is 16 (128 values / 8 subgroups) values in the M dimension, but the GmemTiledCopy you are using is XE_2D_U8x32x32_LD_V (32x32 values). Note that going the other way isn't a problem - i.e. it's totally fine to use a 32x32 block load when each sub-group must load 64x64 values, because the sub-group will perform 4 iterations of the 32x32 block load. But you can't split a block load.

So, you either need to:
* Use a smaller number of sub-groups in that dimension (so each sub-group has 32 values to load)

* Use a smaller block load operation (in this case `16x16`). This is what [@sanchitintel](https://github.com/sanchitintel) [proposes](https://github.com/codeplaysoftware/cutlass-sycl/issues/376#issuecomment-2882779887).
Unfortunately, for the second solution, you are limited by the available block load operations. I think there isn't a U8x16x16_LD_V operation at present.

Highly appreciate the detail explanations, especially regarding to the split block load!

cfgfung added the question Further information is requested label May 14, 2025

cfgfung closed this as completed May 18, 2025

cfgfung reopened this May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Ambiguous error message when tuning the Tileshape #376

[QST] Ambiguous error message when tuning the Tileshape #376

cfgfung commented May 14, 2025 •

edited

Loading

sanchitintel commented May 14, 2025

cfgfung commented May 15, 2025

sanchitintel commented May 15, 2025 •

edited

Loading

joeatodd commented May 15, 2025

joeatodd commented May 15, 2025 •

edited

Loading

sanchitintel commented May 15, 2025 •

edited

Loading

cfgfung commented May 15, 2025

[QST] Ambiguous error message when tuning the Tileshape #376

[QST] Ambiguous error message when tuning the Tileshape #376

Comments

cfgfung commented May 14, 2025 • edited Loading

sanchitintel commented May 14, 2025

cfgfung commented May 15, 2025

sanchitintel commented May 15, 2025 • edited Loading

joeatodd commented May 15, 2025

joeatodd commented May 15, 2025 • edited Loading

sanchitintel commented May 15, 2025 • edited Loading

cfgfung commented May 15, 2025

cfgfung commented May 14, 2025 •

edited

Loading

sanchitintel commented May 15, 2025 •

edited

Loading

joeatodd commented May 15, 2025 •

edited

Loading

sanchitintel commented May 15, 2025 •

edited

Loading