-
Notifications
You must be signed in to change notification settings - Fork 42
optimize embedding bag #1726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
optimize embedding bag #1726
Conversation
eada1a6
to
b566ba6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the SYCL embedding bag kernel by removing the SYCL_KERNEL_ASSERT
(changing GRF mode) and moving branch logic into template parameters to reduce instruction fetch stalls.
- Added
per_sample_weights_defined
andpadding_idx_defined
as compile-time template booleans and refactored the inner loop into a lambda. - Removed runtime branching for padding and per-sample weights and replaced it with
if constexpr
. - Expanded
embedding_bag
template invocations in the.cpp
file to dispatch all four boolean combinations.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
src/ATen/native/xpu/sycl/EmbeddingBag.h | Added template flags, lambda refactoring, and replaced runtime if with if constexpr . |
src/ATen/native/xpu/sycl/EmbeddingBag.cpp | Expanded the macro-based embedding_bag invocations into a four-case dispatch. |
Comments suppressed due to low confidence (1)
src/ATen/native/xpu/sycl/EmbeddingBag.h:60
- [nitpick] The lambda name
handle_non_padding
is descriptive but could be more precise—e.g.,applyNonPaddingOperation
orprocessEmbeddingEntry
—to clearly convey its role.
auto handle_non_padding = [&]() {
for (index_t off = start; off < end; off++) { | ||
index_off = off; | ||
vec_idx = index_[index_off]; | ||
// SYCL_KERNEL_ASSERT(vec_idx < num_row_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Remove the commented-out SYCL_KERNEL_ASSERT
line or replace it with a compile-time check if needed, to keep the codebase clean and avoid dead code.
Copilot uses AI. Check for mistakes.
if (per_sample_weights.defined() && padding_idx != -1) \ | ||
embedding_bag<scalar_t, accscalar_t, index_t, mode, vec_size, true, true>( \ | ||
output.mutable_data_ptr<scalar_t>(), \ | ||
weight.const_data_ptr<scalar_t>(), \ | ||
indices.const_data_ptr<index_t>(), \ | ||
offsets.const_data_ptr<index_t>(), \ | ||
offset2bag.mutable_data_ptr<index_t>(), \ | ||
bag_size.mutable_data_ptr<index_t>(), \ | ||
max_indices.mutable_data_ptr<index_t>(), \ | ||
per_sample_weights.const_data_ptr<scalar_t>(), \ | ||
index_size, \ | ||
bag_num, \ | ||
vec_len, \ | ||
padding_idx, \ | ||
ignore_offsets, \ | ||
num_row); \ | ||
else if (!per_sample_weights.defined() && padding_idx != -1) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The four-way if/else
dispatch for template instantiations leads to a lot of duplicated arguments and boilerplate; consider introducing a small dispatch helper or using parameter packs to select the true/false
template flags, reducing code duplication and improving maintainability.
Copilot uses AI. Check for mistakes.
@sys_pytorchxpubot triage result for run 15561567714Triage bot UT analaysis result for reference only, please note unique error message only report once:
Triage bot response: {
"similar_issue_id": 845,
"similar_issue_state": "closed",
"issue_owner": "daisyden",
"issue_description": "The test TestNN.test_LayerNorm_3d_no_affine_large_feature_cuda failed with an AssertionError: Tensor-likes are not close! The error suggests a discrepancy in tensor values between CUDA and XPU implementations. The test involves computing outputs and gradients on both devices and asserting their closeness, which failed due to significant differences beyond the allowed tolerance.",
"root_causes": [
"Discrepancies in LayerNorm implementation between CUDA and XPU.",
"Potential differences in precision or kernel behavior affecting tensor outputs.",
"Misalignment in computation leading to inconsistent gradients."
],
"suggested_solutions": [
"Investigate and align the LayerNorm implementation across CUDA and XPU to ensure consistent results.",
"Adjust tolerance levels if the discrepancies are deemed acceptable and not indicative of a broader issue.",
"Consider skipping the test if the failure is consistent and not resolvable, similar to prior solutions for tensor comparison issues."
]
} |
performance on input [409581], weight [1000000,64], offset [4096] (4096 bags), dtype = half, mode = sum
Note: We are stalled here
vec_t other = w_vec_[i_off];
when vector size is 8, the assembly isload.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4]; load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC];
After fix, it changes toload.ugm.d32x4
. There is no performance change on peak frequency, but when profiling on lower frequency, I see 9% faster.PVC does not benefit from tiling, in this case, there will be 32 workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2 batch is still a regression. The best config is vec_size=4, and set workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to set a smaller work group size.