optimize embedding bag #1726

jianyizh · 2025-06-09T04:52:38Z

remove SYCL_KERNEL_ASSERT, this will change grf mode from 256 to 128, but there is an existing issue Embedding_bag_out does not have boundary check and causes IPEX UT fail #1052, I did not remove it in this pr. We should add NDEBUG flag later or use vec_size = 4
I see instruction fetch stalls because of the if branches, so move them to template params.
I also fixed the vectorization. Previously we actually do not enable it.
Previously we only use 256 threads per workgroup, but workgroup size is 1024

performance on input [409581], weight [1000000,64], offset [4096] (4096 bags), dtype = half, mode = sum

	PVC	BMG
main branch	0.18ms	0.43ms
remove sycl assert	0.10ms	0.30 ms
remove branching	0.08ms	0.28 ms
tiling	0.087ms	0.22 ms

Note: We are stalled here vec_t other = w_vec_[i_off]; when vector size is 8, the assembly is load.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4]; load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC]; After fix, it changes to load.ugm.d32x4. There is no performance change on peak frequency, but when profiling on lower frequency, I see 9% faster.

PVC does not benefit from tiling, in this case, there will be 32 workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2 batch is still a regression. The best config is vec_size=4, and set workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to set a smaller work group size.

Copilot

Pull Request Overview

This PR optimizes the SYCL embedding bag kernel by removing the SYCL_KERNEL_ASSERT (changing GRF mode) and moving branch logic into template parameters to reduce instruction fetch stalls.

Added per_sample_weights_defined and padding_idx_defined as compile-time template booleans and refactored the inner loop into a lambda.
Removed runtime branching for padding and per-sample weights and replaced it with if constexpr.
Expanded embedding_bag template invocations in the .cpp file to dispatch all four boolean combinations.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
src/ATen/native/xpu/sycl/EmbeddingBag.h	Added template flags, lambda refactoring, and replaced runtime `if` with `if constexpr`.
src/ATen/native/xpu/sycl/EmbeddingBag.cpp	Expanded the macro-based `embedding_bag` invocations into a four-case dispatch.

Comments suppressed due to low confidence (1)

src/ATen/native/xpu/sycl/EmbeddingBag.h:60

[nitpick] The lambda name handle_non_padding is descriptive but could be more precise—e.g., applyNonPaddingOperation or processEmbeddingEntry—to clearly convey its role.

auto handle_non_padding = [&]() {

Copilot · 2025-06-09T06:38:04Z

src/ATen/native/xpu/sycl/EmbeddingBag.h

+        for (index_t off = start; off < end; off++) {
+          index_off = off;
+          vec_idx = index_[index_off];
+          // SYCL_KERNEL_ASSERT(vec_idx < num_row_);


[nitpick] Remove the commented-out SYCL_KERNEL_ASSERT line or replace it with a compile-time check if needed, to keep the codebase clean and avoid dead code.

Copilot · 2025-06-09T06:38:04Z

src/ATen/native/xpu/sycl/EmbeddingBag.cpp

+  if (per_sample_weights.defined() && padding_idx != -1)                       \
+    embedding_bag<scalar_t, accscalar_t, index_t, mode, vec_size, true, true>( \
+        output.mutable_data_ptr<scalar_t>(),                                   \
+        weight.const_data_ptr<scalar_t>(),                                     \
+        indices.const_data_ptr<index_t>(),                                     \
+        offsets.const_data_ptr<index_t>(),                                     \
+        offset2bag.mutable_data_ptr<index_t>(),                                \
+        bag_size.mutable_data_ptr<index_t>(),                                  \
+        max_indices.mutable_data_ptr<index_t>(),                               \
+        per_sample_weights.const_data_ptr<scalar_t>(),                         \
+        index_size,                                                            \
+        bag_num,                                                               \
+        vec_len,                                                               \
+        padding_idx,                                                           \
+        ignore_offsets,                                                        \
+        num_row);                                                              \
+  else if (!per_sample_weights.defined() && padding_idx != -1)                 \


The four-way if/else dispatch for template instantiations leads to a lot of duplicated arguments and boilerplate; consider introducing a small dispatch helper or using parameter packs to select the true/false template flags, reducing code duplication and improving maintainability.

pytorchxpubot · 2025-06-11T09:50:29Z

@sys_pytorchxpubot triage result for run 15561567714

Triage bot UT analaysis result for reference only, please note unique error message only report once:

third_party.torch-xpu-ops.test.xpu.test_nn_xpu.TestNN test_LayerNorm_3d_no_affine_large_feature_cuda got failed with error message

 AssertionError: Tensor-likes are not close!

Triage bot response:

{
  "similar_issue_id": 845,
  "similar_issue_state": "closed",
  "issue_owner": "daisyden",
  "issue_description": "The test TestNN.test_LayerNorm_3d_no_affine_large_feature_cuda failed with an AssertionError: Tensor-likes are not close! The error suggests a discrepancy in tensor values between CUDA and XPU implementations. The test involves computing outputs and gradients on both devices and asserting their closeness, which failed due to significant differences beyond the allowed tolerance.",
  "root_causes": [
    "Discrepancies in LayerNorm implementation between CUDA and XPU.",
    "Potential differences in precision or kernel behavior affecting tensor outputs.",
    "Misalignment in computation leading to inconsistent gradients."
  ],
  "suggested_solutions": [
    "Investigate and align the LayerNorm implementation across CUDA and XPU to ensure consistent results.",
    "Adjust tolerance levels if the discrepancies are deemed acceptable and not indicative of a broader issue.",
    "Consider skipping the test if the failure is consistent and not resolvable, similar to prior solutions for tensor comparison issues."
  ]
}

jianyizh added the kernel_optimization label Jun 9, 2025

save

b566ba6

jianyizh force-pushed the jianyi/embed_bag branch from eada1a6 to b566ba6 Compare June 9, 2025 04:53

jianyizh requested review from xytintel and EikanWang June 9, 2025 06:36

jianyizh marked this pull request as ready for review June 9, 2025 06:36

Copilot AI review requested due to automatic review settings June 9, 2025 06:36

Copilot AI reviewed Jun 9, 2025

View reviewed changes

fix vectorization

77fb45e

jianyizh changed the title ~~optimize embedding bag~~ [dont merge] optimize embedding bag Jun 17, 2025

jianyizh added 2 commits June 17, 2025 07:46

test

c1bd1ed

save

dce2232

jianyizh changed the title ~~[dont merge] optimize embedding bag~~ optimize embedding bag Jun 19, 2025

jianyizh and others added 3 commits June 19, 2025 05:07

save

40d76fb

Merge branch 'main' into jianyi/embed_bag

4fa7a63

Merge branch 'main' into jianyi/embed_bag

6046034

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize embedding bag #1726

optimize embedding bag #1726

Uh oh!

jianyizh commented Jun 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 9, 2025

Uh oh!

Copilot AI Jun 9, 2025

Uh oh!

pytorchxpubot commented Jun 11, 2025

Uh oh!

Uh oh!

optimize embedding bag #1726

Are you sure you want to change the base?

optimize embedding bag #1726

Uh oh!

Conversation

jianyizh commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

pytorchxpubot commented Jun 11, 2025

Uh oh!

Uh oh!

jianyizh commented Jun 9, 2025 •

edited

Loading