Skip to content

optimize embedding bag #1726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

optimize embedding bag #1726

wants to merge 7 commits into from

Conversation

jianyizh
Copy link
Contributor

@jianyizh jianyizh commented Jun 9, 2025

  1. remove SYCL_KERNEL_ASSERT, this will change grf mode from 256 to 128, but there is an existing issue Embedding_bag_out does not have boundary check and causes IPEX UT fail #1052, I did not remove it in this pr. We should add NDEBUG flag later or use vec_size = 4
  2. I see instruction fetch stalls because of the if branches, so move them to template params.
  3. I also fixed the vectorization. Previously we actually do not enable it.
  4. Previously we only use 256 threads per workgroup, but workgroup size is 1024

performance on input [409581], weight [1000000,64], offset [4096] (4096 bags), dtype = half, mode = sum

PVC BMG
main branch 0.18ms 0.43ms
remove sycl assert 0.10ms 0.30 ms
remove branching 0.08ms 0.28 ms
tiling 0.087ms 0.22 ms

Note: We are stalled here vec_t other = w_vec_[i_off]; when vector size is 8, the assembly is load.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4]; load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC]; After fix, it changes to load.ugm.d32x4. There is no performance change on peak frequency, but when profiling on lower frequency, I see 9% faster.

PVC does not benefit from tiling, in this case, there will be 32 workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2 batch is still a regression. The best config is vec_size=4, and set workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to set a smaller work group size.

@jianyizh jianyizh force-pushed the jianyi/embed_bag branch from eada1a6 to b566ba6 Compare June 9, 2025 04:53
@jianyizh jianyizh requested review from xytintel and EikanWang June 9, 2025 06:36
@jianyizh jianyizh marked this pull request as ready for review June 9, 2025 06:36
@Copilot Copilot AI review requested due to automatic review settings June 9, 2025 06:36
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the SYCL embedding bag kernel by removing the SYCL_KERNEL_ASSERT (changing GRF mode) and moving branch logic into template parameters to reduce instruction fetch stalls.

  • Added per_sample_weights_defined and padding_idx_defined as compile-time template booleans and refactored the inner loop into a lambda.
  • Removed runtime branching for padding and per-sample weights and replaced it with if constexpr.
  • Expanded embedding_bag template invocations in the .cpp file to dispatch all four boolean combinations.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/ATen/native/xpu/sycl/EmbeddingBag.h Added template flags, lambda refactoring, and replaced runtime if with if constexpr.
src/ATen/native/xpu/sycl/EmbeddingBag.cpp Expanded the macro-based embedding_bag invocations into a four-case dispatch.
Comments suppressed due to low confidence (1)

src/ATen/native/xpu/sycl/EmbeddingBag.h:60

  • [nitpick] The lambda name handle_non_padding is descriptive but could be more precise—e.g., applyNonPaddingOperation or processEmbeddingEntry—to clearly convey its role.
auto handle_non_padding = [&]() {

for (index_t off = start; off < end; off++) {
index_off = off;
vec_idx = index_[index_off];
// SYCL_KERNEL_ASSERT(vec_idx < num_row_);
Copy link
Preview

Copilot AI Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Remove the commented-out SYCL_KERNEL_ASSERT line or replace it with a compile-time check if needed, to keep the codebase clean and avoid dead code.

Copilot uses AI. Check for mistakes.

Comment on lines +121 to +137
if (per_sample_weights.defined() && padding_idx != -1) \
embedding_bag<scalar_t, accscalar_t, index_t, mode, vec_size, true, true>( \
output.mutable_data_ptr<scalar_t>(), \
weight.const_data_ptr<scalar_t>(), \
indices.const_data_ptr<index_t>(), \
offsets.const_data_ptr<index_t>(), \
offset2bag.mutable_data_ptr<index_t>(), \
bag_size.mutable_data_ptr<index_t>(), \
max_indices.mutable_data_ptr<index_t>(), \
per_sample_weights.const_data_ptr<scalar_t>(), \
index_size, \
bag_num, \
vec_len, \
padding_idx, \
ignore_offsets, \
num_row); \
else if (!per_sample_weights.defined() && padding_idx != -1) \
Copy link
Preview

Copilot AI Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The four-way if/else dispatch for template instantiations leads to a lot of duplicated arguments and boilerplate; consider introducing a small dispatch helper or using parameter packs to select the true/false template flags, reducing code duplication and improving maintainability.

Copilot uses AI. Check for mistakes.

@pytorchxpubot
Copy link

@sys_pytorchxpubot triage result for run 15561567714Triage bot UT analaysis result for reference only, please note unique error message only report once:
  1. third_party.torch-xpu-ops.test.xpu.test_nn_xpu.TestNN test_LayerNorm_3d_no_affine_large_feature_cuda got failed with error message
 AssertionError: Tensor-likes are not close! 

Triage bot response:

{
  "similar_issue_id": 845,
  "similar_issue_state": "closed",
  "issue_owner": "daisyden",
  "issue_description": "The test TestNN.test_LayerNorm_3d_no_affine_large_feature_cuda failed with an AssertionError: Tensor-likes are not close! The error suggests a discrepancy in tensor values between CUDA and XPU implementations. The test involves computing outputs and gradients on both devices and asserting their closeness, which failed due to significant differences beyond the allowed tolerance.",
  "root_causes": [
    "Discrepancies in LayerNorm implementation between CUDA and XPU.",
    "Potential differences in precision or kernel behavior affecting tensor outputs.",
    "Misalignment in computation leading to inconsistent gradients."
  ],
  "suggested_solutions": [
    "Investigate and align the LayerNorm implementation across CUDA and XPU to ensure consistent results.",
    "Adjust tolerance levels if the discrepancies are deemed acceptable and not indicative of a broader issue.",
    "Consider skipping the test if the failure is consistent and not resolvable, similar to prior solutions for tensor comparison issues."
  ]
}

@jianyizh jianyizh changed the title optimize embedding bag [dont merge] optimize embedding bag Jun 17, 2025
@jianyizh jianyizh changed the title [dont merge] optimize embedding bag optimize embedding bag Jun 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants