Skip to content

Add output_dtype to training TBE op for CPU #829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

shintaro-iwasaki
Copy link
Contributor

Summary: This Diff adds output_dtype to split_embedding_codegen_lookup_{{ optimizer }}_function_cpu(). Note that the CUDA version (split_embedding_codegen_lookup_{{ optimizer }}_function()) already has this argument (D32399931 (7e1183c)).

Reviewed By: jianyuh

Differential Revision: D32969921

Summary: This Diff adds `output_dtype` to `split_embedding_codegen_lookup_{{ optimizer }}_function_cpu()`. Note that the CUDA version (`split_embedding_codegen_lookup_{{ optimizer }}_function()`) already has this argument (D32399931 (pytorch@7e1183c)).

Reviewed By: jianyuh

Differential Revision: D32969921

fbshipit-source-id: ff8870bc6ac7f659efae8a5330d635a5ac06cf9f
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D32969921

q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
Summary:
X-link: pytorch#3748

Pull Request resolved: facebookresearch/FBGEMM#829

` b_t_map` contains information of batch (`b`) and feature (`t`). `info_B_num_bits` tells how many bits are used to cover batch information and is currently computed each iteration given the batch size.

The `info_B_num_bits` calculation is problematic when `max_B_` is symbolic, causing issues with eagerAOT mode. If `max_B_` is symbolic, `info_B_num_bits`  is not recomputed and uses the default value which can fail or if there is not enough bits for B.

To resolve the issues, we can make `info_B_num_bits` constant. Current implementation adjusts `info_B_num_bits` based on the batch size, causing it to change every iteration. Fixing the values may cause the aforementioned issue of having insufficient bits for B.

This diff implements `get_info_B_num_bits_from_T` to make `info_B_num_bits` constant. We first calculate how many bits required to cover `T` information, as number of features are known at TBE initialization and will remain the same throughout the run. The rest of the bits will be for `B` information. Since `info_T_num_bits` remains the same, `info_B_num_bits` remains the same. If there's not enough bits for B, it will fail.

In V1 interface, since we hit the limit for the maximum number of arguments, we keep the interface the same.

In V2 interface (next diff), we compute `info_B_num_bits` and `info_B_mask` once, store them as module parameters, and pass them to lookup and corresponding Autograd and backend functions.

Reviewed By: sryap

Differential Revision: D69387123

fbshipit-source-id: 3ddc1a3e3822525db907a58d66fedc4148e7bc7a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants