add gemm with rmsnorm #321

yuankuns · 2025-04-22T00:57:45Z

Add an example with post op RMSNorm after gemm

aacostadiaz

Looks good. I left a few minor comments

aacostadiaz · 2025-05-14T12:41:11Z

examples/sycl/05_pvc_gemm_with_epilogues/05_pvc_gemm_with_epilogue_rmsnorm.cpp

+        gemm_op.run();
+      }
+      syclcompat::wait();
+    double io =


Suggested change

double io =

double io =

aacostadiaz · 2025-05-14T12:47:40Z

include/cutlass/epilogue/fusion/xe_visitor_rmsnorm.hpp

+#pragma once
+
+#include "cutlass/cutlass.h"
+#include <sycl/sycl.hpp>


sycl is already included in cutlass/cutlass.h (gpu_generics.h)

Suggested change

#include <sycl/sycl.hpp>

joeatodd · 2025-05-20T10:22:42Z

Hello @yuankuns. Since we made some extensive naming changes since you submitted this PR, I thought I'd help you out & provide the required changes to this branch. It's the last commit on my gemmrmsnorm-updates branch. That should fix the CI failures 👍

joeatodd

Looks good but I think the implementation could be made more EVT-friendly.

joeatodd · 2025-05-20T08:26:12Z

examples/sycl/05_pvc_gemm_with_epilogues/05_pvc_gemm_with_epilogue_rmsnorm.cpp

+                       N * L * sizeof(ElementW));
+    syclcompat::wait();
+
+    constexpr float eps = 1e-5;


I think we should use options.eps here.

joeatodd · 2025-05-20T10:24:29Z

examples/sycl/05_pvc_gemm_with_epilogues/05_pvc_gemm_with_epilogue_rmsnorm.cpp

+    gemm_op.can_implement(arguments);
+
+    gemm_op.initialize(arguments, workspace.get());
+
+    // Run the GEMM
+    gemm_op.run();


Please follow e.g. 00_pvc_gemm.cpp and ensure that the example returns an early failure if these steps fail.

joeatodd · 2025-05-20T10:28:07Z

include/cutlass/epilogue/fusion/xe_callbacks.hpp

+    using StrideWeight = Stride<_1, _0, int64_t>;
+    ElementWeight const* weight_ptr = nullptr;
+    float eps = 1e-5;
+    StrideWeight dWeight = {};


Is this unused?

joeatodd · 2025-05-20T10:38:29Z

include/cutlass/epilogue/fusion/xe_visitor_rmsnorm.hpp

+                auto loop_t = res(_, loop, _);
+                auto pow2_t = pow2_buff(_, loop, _);
+                Tensor group_sum = make_tensor<float>(make_shape(Int<vec_size>{}));
+                float rev_dim = 1 / (float)params.inner_dim;


Suggested change

float rev_dim = 1 / (float)params.inner_dim;

const float rev_dim = 1 / static_cast<float>(params.inner_dim);

This could be brought out the loop too.

joeatodd · 2025-05-20T10:39:08Z

include/cutlass/epilogue/fusion/xe_visitor_rmsnorm.hpp

+            int gx = syclcompat::global_id::x() % 256;
+            int gy = syclcompat::global_id::y();
+            auto gid = gx / 16 * 32 + gx % 16;


A comment to explain why these calculations are being performed would be useful. Why 256, 32, 16?

Also I think gy is unused.

joeatodd · 2025-05-20T12:23:30Z

include/cutlass/epilogue/fusion/xe_visitor_rmsnorm.hpp

+                }
+                CUTLASS_PRAGMA_UNROLL
+                for (int i = 0; i < Epi_N / IntelPVCEpilogue::SubgroupSize; i++) {
+                    const float wgt_per_col = (float)wgt_ptr[gid + i * IntelPVCEpilogue::SubgroupSize];


Loading this weight data here is kind of an anti-pattern in the context of EVT epilogues. There is a specific EVT operation for this: XeRowBroadcast, which will load data once and broadcast it as required. For example, the Linear Combination With Per Column Bias is defined as:

// template args... using XeLinCombPerColBias = Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementOutput, ElementCompute, RoundStyle>, // beta * C + (alpha * acc + bias) Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>>, // beta Sm90SrcFetch<ElementSource>, // C Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementCompute, ElementCompute, RoundStyle>, // alpha * acc + bias Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>>, // alpha Sm90AccFetch, // acc XeRowBroadcast<0, CtaTileShapeMNK, ElementBias, ElementCompute, Stride<_0,_1,int64_t>, AlignmentBias> // bias > >;

Since for RMSNorm, the * weight is effectively an independent calculation, this approach could be accomplished by:

Remove all references to the weight from XeRMSNormRowReduction

Define an outer layer in your SM90Evt definition which does an SM90Compute<multiplies,...>, taking inputs: XeRMSNormRowReduction and XeRowBroadcast.

Taking this approach has the advantages that:

It will be generally correct, regardless of thread_idx layout etc

We have one fewer 'load' operation in the library to optimize

The RMSNorm operation has fewer responsibilities, and could more easily be generalized (e.g. to reduce over more/different dimensions) in future

yuankuns and others added 5 commits April 9, 2025 23:36

upload draft for gemm rms norm

be4eb69

add rmsnorm post ops

f123be4

remove comment

a185040

update version info

ed681cc

Merge branch 'sycl-develop' into syk/gemmrmsnorm

dd8b8f0

aacostadiaz reviewed May 14, 2025

View reviewed changes

aacostadiaz and others added 2 commits May 14, 2025 14:52

Merge branch 'sycl-develop' into syk/gemmrmsnorm

42cf847

Merge branch 'sycl-develop' into syk/gemmrmsnorm

56fd3e7

joeatodd requested changes May 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add gemm with rmsnorm #321

add gemm with rmsnorm #321

yuankuns commented Apr 22, 2025

aacostadiaz left a comment

aacostadiaz May 14, 2025

aacostadiaz May 14, 2025

joeatodd commented May 20, 2025

joeatodd left a comment

joeatodd May 20, 2025

joeatodd May 20, 2025

joeatodd May 20, 2025

joeatodd May 20, 2025

joeatodd May 20, 2025

joeatodd May 20, 2025

joeatodd May 20, 2025

	float rev_dim = 1 / (float)params.inner_dim;
	const float rev_dim = 1 / static_cast<float>(params.inner_dim);

add gemm with rmsnorm #321

Are you sure you want to change the base?

add gemm with rmsnorm #321

Conversation

yuankuns commented Apr 22, 2025

aacostadiaz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joeatodd commented May 20, 2025

joeatodd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment