-
Notifications
You must be signed in to change notification settings - Fork 617
split cpu parts in permute_pooled_embedding_ops for cpu_only #987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @RabbitWhite1! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
cc @geyyer : please take a look at this PR. |
Fixed duplicated m.def in permute_pooled_embedding_ops_gpu.cpp |
@RabbitWhite1, thanks for contributing and noticing the possible issue with cpu-only build! To avoid compatibility issue we have to create an empty file permute_pooled_embs_function.h and then move the class template PermutePooledEmbsFunction there. The PR looks good, could you move permute_pooled_embedding_ops_utils.h contents to permute_pooled_embs_function.h and fix dependencies? |
Great! I've moved and committed. |
@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@geyyer has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@geyyer has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Catching up with pytorch#987 Differential Revision: D35261607 fbshipit-source-id: c806f035bfcb4f8872d083431dc7819f6532b7d8
This pull request has been reverted by 386a473. |
Summary: X-link: pytorch#3896 Pull Request resolved: facebookresearch/FBGEMM#987 Fix fp8 kv cache dequantization kernel and enable unit test on AMD. The kernel uses each thread to dequantize 4 elements for both K and V and each warp for a head. The dim is always 128. So on NV this works as one warp has 32 threads on NV (4 * 32 = 128). On AMD, each wavefront (warp) has 64 threads, so the second 32 threads will all do out-of-bound memory access.... This diff simply masks those threads to do nothing. Obviously the perf is not good but from E2E testing, it does not seem to matter. If we need to optimize the perf for AMD, we can let thread 0 ~ 31 dequantize 4 elements for K and thread 32 ~ 63 thread dequantize 4 elements for V. Reviewed By: Aya-ZIbra Differential Revision: D72062745 fbshipit-source-id: 1b813057586054a13df4e9088be00b08f912bc57
As specified in CMakeLists.txt, "src/permute_pooled_embedding_ops_gpu.cpp" will only compile when "NOT FBGEMM_CPU_ONLY", which means the method "permute_pooled_embs_auto_grad" won't be generated when --cpu_only. However, this method is used by torchrec's column_wise sharding.
The pr mentioned in #950 cannot work because of not using m.def to define permute_pooled_embs_auto_grad.