Add output_dtype to training TBE op for CPU #829

shintaro-iwasaki · 2021-12-23T23:28:09Z

Summary: This Diff adds output_dtype to split_embedding_codegen_lookup_{{ optimizer }}_function_cpu(). Note that the CUDA version (split_embedding_codegen_lookup_{{ optimizer }}_function()) already has this argument (D32399931 (7e1183c)).

Reviewed By: jianyuh

Differential Revision: D32969921

Summary: This Diff adds `output_dtype` to `split_embedding_codegen_lookup_{{ optimizer }}_function_cpu()`. Note that the CUDA version (`split_embedding_codegen_lookup_{{ optimizer }}_function()`) already has this argument (D32399931 (pytorch@7e1183c)). Reviewed By: jianyuh Differential Revision: D32969921 fbshipit-source-id: ff8870bc6ac7f659efae8a5330d635a5ac06cf9f

facebook-github-bot · 2021-12-23T23:28:34Z

This pull request was exported from Phabricator. Differential Revision: D32969921

Summary: X-link: pytorch#3748 Pull Request resolved: facebookresearch/FBGEMM#829 ` b_t_map` contains information of batch (`b`) and feature (`t`). `info_B_num_bits` tells how many bits are used to cover batch information and is currently computed each iteration given the batch size. The `info_B_num_bits` calculation is problematic when `max_B_` is symbolic, causing issues with eagerAOT mode. If `max_B_` is symbolic, `info_B_num_bits` is not recomputed and uses the default value which can fail or if there is not enough bits for B. To resolve the issues, we can make `info_B_num_bits` constant. Current implementation adjusts `info_B_num_bits` based on the batch size, causing it to change every iteration. Fixing the values may cause the aforementioned issue of having insufficient bits for B. This diff implements `get_info_B_num_bits_from_T` to make `info_B_num_bits` constant. We first calculate how many bits required to cover `T` information, as number of features are known at TBE initialization and will remain the same throughout the run. The rest of the bits will be for `B` information. Since `info_T_num_bits` remains the same, `info_B_num_bits` remains the same. If there's not enough bits for B, it will fail. In V1 interface, since we hit the limit for the maximum number of arguments, we keep the interface the same. In V2 interface (next diff), we compute `info_B_num_bits` and `info_B_mask` once, store them as module parameters, and pass them to lookup and corresponding Autograd and backend functions. Reviewed By: sryap Differential Revision: D69387123 fbshipit-source-id: 3ddc1a3e3822525db907a58d66fedc4148e7bc7a

facebook-github-bot added cla signed fb-exported labels Dec 23, 2021

facebook-github-bot closed this in d1907f1 Dec 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add output_dtype to training TBE op for CPU #829

Add output_dtype to training TBE op for CPU #829

Uh oh!

shintaro-iwasaki commented Dec 23, 2021

Uh oh!

facebook-github-bot commented Dec 23, 2021

Uh oh!

Uh oh!

Add output_dtype to training TBE op for CPU #829

Add output_dtype to training TBE op for CPU #829

Uh oh!

Conversation

shintaro-iwasaki commented Dec 23, 2021

Uh oh!

facebook-github-bot commented Dec 23, 2021

Uh oh!

Uh oh!