split cpu parts in permute_pooled_embedding_ops for cpu_only #987

RabbitWhite1 · 2022-03-17T06:40:05Z

As specified in CMakeLists.txt, "src/permute_pooled_embedding_ops_gpu.cpp" will only compile when "NOT FBGEMM_CPU_ONLY", which means the method "permute_pooled_embs_auto_grad" won't be generated when --cpu_only. However, this method is used by torchrec's column_wise sharding.

The pr mentioned in #950 cannot work because of not using m.def to define permute_pooled_embs_auto_grad.

facebook-github-bot · 2022-03-17T06:40:10Z

Hi @RabbitWhite1!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot · 2022-03-17T08:08:48Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

facebook-github-bot · 2022-03-18T07:08:38Z

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jianyuh · 2022-03-18T07:09:10Z

cc @geyyer : please take a look at this PR.

RabbitWhite1 · 2022-03-18T11:21:27Z

Fixed duplicated m.def in permute_pooled_embedding_ops_gpu.cpp

geyyer · 2022-03-18T19:26:49Z

@RabbitWhite1, thanks for contributing and noticing the possible issue with cpu-only build! To avoid compatibility issue we have to create an empty file permute_pooled_embs_function.h and then move the class template PermutePooledEmbsFunction there. The PR looks good, could you move permute_pooled_embedding_ops_utils.h contents to permute_pooled_embs_function.h and fix dependencies?

RabbitWhite1 · 2022-03-18T19:34:53Z

@RabbitWhite1, thanks for contributing and noticing the possible issue with cpu-only build! To avoid compatibility issue we have to create an empty file permute_pooled_embs_function.h and then move the class template PermutePooledEmbsFunction there. The PR looks good, could you move permute_pooled_embedding_ops_utils.h contents to permute_pooled_embs_function.h and fix dependencies?

Great! I've moved and committed.

facebook-github-bot · 2022-03-18T20:09:58Z

@jianyuh has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-03-18T20:45:11Z

@geyyer has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

…te1/FBGEMM into cpu_permute_pooled_emb

facebook-github-bot · 2022-03-25T18:46:55Z

@geyyer has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Catching up with pytorch#987 Differential Revision: D35261607 fbshipit-source-id: c806f035bfcb4f8872d083431dc7819f6532b7d8

Summary: Pull Request resolved: #1021 Catching up with #987 Reviewed By: jianyuh Differential Revision: D35261607 fbshipit-source-id: 3802b09b963d1a6b3a6edc4498d2bbe748f390dd

facebook-github-bot · 2022-04-01T00:07:00Z

This pull request has been reverted by 386a473.

Summary: X-link: pytorch#3896 Pull Request resolved: facebookresearch/FBGEMM#987 Fix fp8 kv cache dequantization kernel and enable unit test on AMD. The kernel uses each thread to dequantize 4 elements for both K and V and each warp for a head. The dim is always 128. So on NV this works as one warp has 32 threads on NV (4 * 32 = 128). On AMD, each wavefront (warp) has 64 threads, so the second 32 threads will all do out-of-bound memory access.... This diff simply masks those threads to do nothing. Obviously the perf is not good but from E2E testing, it does not seem to matter. If we need to optimize the perf for AMD, we can let thread 0 ~ 31 dequantize 4 elements for K and thread 32 ~ 63 thread dequantize 4 elements for V. Reviewed By: Aya-ZIbra Differential Revision: D72062745 fbshipit-source-id: 1b813057586054a13df4e9088be00b08f912bc57

split cpu parts in permute_pooled_embedding_ops for cpu_only

70d0bb5

This was referenced Mar 17, 2022

[Bug] cpu_only fbgemm won't compile "permute_pooled_embs_auto_grad" #988

Closed

[Bug] cpu_only fbgemm won't compile "permute_pooled_embs_auto_grad" pytorch/torchrec#160

Closed

facebook-github-bot added the cla signed label Mar 17, 2022

drop duplicated m.def

2189b51

move PermutePooledEmbsFunction to permute_pooled_embs_function.h

b3499a9

RabbitWhite1 added 2 commits March 25, 2022 15:33

update gitignore (_skbuild)

d789c6f

Merge branch 'cpu_permute_pooled_emb' of https://github.com/RabbitWhi…

028d603

…te1/FBGEMM into cpu_permute_pooled_emb

facebook-github-bot closed this in 0221cd8 Mar 30, 2022

This was referenced Mar 30, 2022

Split CPU and GPU ops in permute_pooled_embs_auto_grad #950

Closed

Update .gitignore #1021

Closed

geyyer pushed a commit to geyyer/FBGEMM that referenced this pull request Mar 30, 2022

Update .gitignore

2e2a58a

Summary: Catching up with pytorch#987 Differential Revision: D35261607 fbshipit-source-id: c806f035bfcb4f8872d083431dc7819f6532b7d8

facebook-github-bot pushed a commit that referenced this pull request Mar 31, 2022

Update .gitignore (#1021)

57995c3

Summary: Pull Request resolved: #1021 Catching up with #987 Reviewed By: jianyuh Differential Revision: D35261607 fbshipit-source-id: 3802b09b963d1a6b3a6edc4498d2bbe748f390dd

facebook-github-bot added the Reverted label Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

split cpu parts in permute_pooled_embedding_ops for cpu_only #987

split cpu parts in permute_pooled_embedding_ops for cpu_only #987

Uh oh!

RabbitWhite1 commented Mar 17, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 17, 2022

Uh oh!

facebook-github-bot commented Mar 17, 2022

Uh oh!

facebook-github-bot commented Mar 18, 2022

Uh oh!

jianyuh commented Mar 18, 2022

Uh oh!

RabbitWhite1 commented Mar 18, 2022

Uh oh!

geyyer commented Mar 18, 2022

Uh oh!

RabbitWhite1 commented Mar 18, 2022

Uh oh!

facebook-github-bot commented Mar 18, 2022

Uh oh!

facebook-github-bot commented Mar 18, 2022

Uh oh!

facebook-github-bot commented Mar 25, 2022

Uh oh!

facebook-github-bot commented Apr 1, 2022

Uh oh!

Uh oh!

split cpu parts in permute_pooled_embedding_ops for cpu_only #987

split cpu parts in permute_pooled_embedding_ops for cpu_only #987

Uh oh!

Conversation

RabbitWhite1 commented Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 17, 2022

Action Required

Process

Uh oh!

facebook-github-bot commented Mar 17, 2022

Uh oh!

facebook-github-bot commented Mar 18, 2022

Uh oh!

jianyuh commented Mar 18, 2022

Uh oh!

RabbitWhite1 commented Mar 18, 2022

Uh oh!

geyyer commented Mar 18, 2022

Uh oh!

RabbitWhite1 commented Mar 18, 2022

Uh oh!

facebook-github-bot commented Mar 18, 2022

Uh oh!

facebook-github-bot commented Mar 18, 2022

Uh oh!

facebook-github-bot commented Mar 25, 2022

Uh oh!

facebook-github-bot commented Apr 1, 2022

Uh oh!

Uh oh!

RabbitWhite1 commented Mar 17, 2022 •

edited

Loading