-
Notifications
You must be signed in to change notification settings - Fork 523
benchmark of fbgemm op - permute_multi_embedding #2158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This pull request was exported from Phabricator. Differential Revision: D58906839 |
2f5be96
to
3bf66cd
Compare
This pull request was exported from Phabricator. Differential Revision: D58906839 |
Summary: # context * we are adding fbgemm operators for the KT.regroup function. * we wanted a good way to measure the performance beside the runtime * **trace is very important to evaluate the actual performance impact** * for example, just from the GPU runtime readings, it seems like the native-pytorch implementation (`_regroup_keyed_tenors`) has better performance over the fbgemm_gpu implementation (`KeyedTensor.regroup`) * but if we look at the CPU/GPU traces, we'll find that the native-pytorch implementation is actually CPU-bounded, and has very bad impact on the overall performance. # usage * to generate trace file in the given path (.) ``` buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=. ``` ``` $ ll *.json -rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 943675 Jun 21 22:21 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350349 Jun 21 22:21 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json ``` # performance ``` INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1 INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000 INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 ``` # traces * _regroup_keyed_tenors {F1712147044} * KeyedTensor.regroup {F1712148863} * KTRegroupAsDict {F1712150411} Differential Revision: D58906521
Summary: X-link: pytorch/FBGEMM#2738 # context * current we have a working function `permute_pooled_embs_auto_grad` to do a full permute of KTs, including forward and backward * it has several limitations: a) it has to be a full permute, duplicates are not supported; b) in the main [use case](https://fburl.com/code/89od0rqm) there has to be a torch.concat on the input KTs, which is not very efficient; c) the function output a single KT which requires a split operation * there is some attempt to support duplicated outputs, but the backward doesn't work * this diff is trying to create a new kernel (named `permute_multi_embedding`) to support a multiple-KT to multiple-KT mapping operation with backward support # notes * this diff focuses on the implemenation and test of the operator * performance analysis and benchmark are in the next diff # operator example usage * used in python ``` # test inputs: 3 KTs with batch_size=2048 batch_size = 2048 keys = [["f1", "f2"], ["f3", "f4", "f5"], ["f6"]] lengths = [[96, 256], [512, 128, 768], [1024]] values = [ torch.randn(batch_size, sum(lens), device="cuda", requires_grad=True) for lens in lengths ] # target outputs: 4 KTs with re-arranged keys (features), duplicates are allowed groups = [["f1", "f3"], ["f2"], ["f4", "f1", "f6"], ["f1", "f5"]] # accessorial arguments to the op/kernel permutes, in_lengths, out_lengths = _multi_remap_to_groups( keys, lengths, groups ) # arguments outputs = torch.ops.fbgemm.permute_multi_embedding( values, permutes, in_lengths, out_lengths ) ``` * permutes ``` permutes = tensor( [ [0, 0, 0, 0, 3, 4], # f1 [1, 0, 0, 3, 5, 0], # f3 [0, 1, 3, 0, 4, 0], # f2 [1, 2, 5, 0, 6, 0], # f4 [0, 2, 0, 6, 3, -6], # f1 [2, 2, 0, 9, 8, 0], # f6 [0, 3, 0, 0, 3, -8], # f1 [1, 3, 11, 3, 7, 0], # f5 ] ) ``` # details 1. from the above example usage, we can clearly see that the operatior takes in the following: a) values: List[torch.Tensor], which represents the input KTs b) permutes: torch.Tensor, which contains the permute information, will be explained later c) output_lengths_list: List[int], the lengths of the output tensors (KTs), which is needed to allocate memory on device ahead d) in_lengths: torch.Tensor, lengths of input tensors, which is on device e) out_lengths: torch.Tensor, lengths of output tensors, which is on device 2. the operator returns a list of tensors, which represents the permuted KTs 3. `permute` is the most critical argument in this operator: a) 2-D tensor b) each row represents a key (feature) permute move c) a permute move = [input_tensor_id, output_tensor_id, input_start_idx, output_start_idx, feature_length, jump] d) jump is used in backward when a key (feature) from the input tensor is mapped to multiple places in the output tensors Differential Revision: D57055616
Summary: X-link: pytorch/FBGEMM#2771 # context * added both **op-level** and **fn-level** benchmarks for the KT.regroup implementations * analyze the op-level and fn-level performance in runtime and memory usage * findings are that: **a**. In the fn-level performance, the `permute_multi_embedding` (new op) outperforms both the native-pytorch implementation and the `permute_pooled_embs_auto_grad` (current Prod) **b**. In the op-level performance, the new op is slower than the current prod because the new op integrates # performance notes The good: 1. the algorithm is designed in a way that it doesn't need to know in advance whether the 1-to-N mapping exists in the permutes. 2. `_all_keys_used_once` is no longer needed 3. no longer need a torch.cat before calling the old operator 4. no need to use `_pin_and_move` for the meta data (arguments), it will be handled inside the operator, it's more friendly to tracing. The same bad: 1. it requires several HtoD communications (move tensor to device): a) [resolved] 3 tensors, which are `permutes`, `input_lengths`, and `output_lengths`. Those tensors needs to be on the device so that the cuda kernels has access to it. b) [resolved] 2 lists of (scalar_t*) pointers, input and output tensor lists. c) [resolved] Didn't find a good way to let the kernel knows the address of the lists of input/output tensors, because the lists are also need to be on the device. 2. tensor.contiguous for the backward function, it looks like the grad from the backward are somehow not contiguous. # benchmark * op-level results ``` INFO:root:size: 1024 x 57168; permute_multi_embedding: 1.5612200498580933 ms; permute_pooled_embs_auto_grad: 0.9015970826148987 ms INFO:root:size: 1024 x 134096; permute_multi_embedding: 3.0794131755828857 ms; permute_pooled_embs_auto_grad: 2.114053726196289 ms INFO:root:size: 1024 x 136752; permute_multi_embedding: 2.6919198036193848 ms; permute_pooled_embs_auto_grad: 2.159184455871582 ms INFO:root:size: 1024 x 260944; permute_multi_embedding: 4.805435180664063 ms; permute_pooled_embs_auto_grad: 4.098493576049805 ms INFO:root:size: 1024 x 538432; permute_multi_embedding: 9.359790802001953 ms; permute_pooled_embs_auto_grad: 8.504887580871582 ms INFO:root:size: 1024 x 536592; permute_multi_embedding: 9.375926017761232 ms; permute_pooled_embs_auto_grad: 8.459586143493652 ms ``` * fn-level results ``` _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 permute_multi_embs | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 1011.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 permute_multi_embs_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 3.2 ms | Memory (P90): 1011.0 ``` # traces * [files](https://drive.google.com/drive/folders/1_9hOtQUQeFICBVxQtusvpQ_VajduFUmR?usp=sharing) ``` [[email protected] /data/sandcastle/boxes/fbsource (ae677c240)]$ ll *.json -rw-rw-r-- 1 hhy hhy 8062993 Jun 21 23:26 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 949610 Jun 21 23:26 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140143 Jun 21 23:26 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350370 Jun 21 23:26 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 581033 Jun 21 23:26 trace-permute_multi_embs_dup.json -rw-rw-r-- 1 hhy hhy 582607 Jun 21 23:26 trace-permute_multi_embs.json -rw-rw-r-- 1 hhy hhy 8025337 Jun 21 23:26 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041586 Jun 21 23:26 trace-_regroup_keyed_tenors.json ``` Differential Revision: D58906839
3bf66cd
to
e8f1081
Compare
This pull request was exported from Phabricator. Differential Revision: D58906839 |
facebook-github-bot
pushed a commit
to pytorch/FBGEMM
that referenced
this pull request
Jun 22, 2024
Summary: X-link: pytorch/torchrec#2158 # context * added both **op-level** and **fn-level** benchmarks for the KT.regroup implementations * analyze the op-level and fn-level performance in runtime and memory usage * findings are that: **a**. In the fn-level performance, the `permute_multi_embedding` (new op) outperforms both the native-pytorch implementation and the `permute_pooled_embs_auto_grad` (current Prod) by 50% GPU runtime and 33% memory usage **b**. In the op-level performance, the new op is slightly slower than the current prod (by ~5% GPU runtime) * conclusion: **we should use the new op** # other considerations The good: 1. the algorithm is designed in a way that it doesn't need to know in advance whether the 1-to-N mapping exists in the permutes. 2. `_all_keys_used_once` is no longer needed 3. no longer need a torch.cat before calling the old operator 4. no need to use `_pin_and_move` for the meta data (arguments), it will be handled inside the operator, it's more friendly to tracing. The same bad: 1. it requires several HtoD communications (move tensor to device): a) [resolved] 3 tensors, which are `permutes`, `input_lengths`, and `output_lengths`. Those tensors needs to be on the device so that the cuda kernels has access to it. b) [resolved] 2 lists of (scalar_t*) pointers, input and output tensor lists. c) [resolved] Didn't find a good way to let the kernel knows the address of the lists of input/output tensors, because the lists are also need to be on the device. 2. tensor.contiguous for the backward function, it looks like the grad from the backward are somehow not contiguous. # benchmark * op-level results: new op is ~5% slower in GPU runtime ``` INFO:root:size: 1024 x 136896; permute_multi_embedding: 2.25 ms; permute_pooled_embs: 2.15 ms; delta: 4.5% INFO:root:size: 1024 x 108432; permute_multi_embedding: 1.79 ms; permute_pooled_embs: 1.7 ms; delta: 5.3% INFO:root:size: 1024 x 277232; permute_multi_embedding: 4.54 ms; permute_pooled_embs: 4.37 ms; delta: 3.9% INFO:root:size: 1024 x 244352; permute_multi_embedding: 4.01 ms; permute_pooled_embs: 3.83 ms; delta: 4.9% INFO:root:size: 1024 x 524224; permute_multi_embedding: 8.62 ms; permute_pooled_embs: 8.25 ms; delta: 4.5% INFO:root:size: 1024 x 564080; permute_multi_embedding: 9.27 ms; permute_pooled_embs: 8.92 ms; delta: 3.9% ``` * fn-level results: new op is 50%+ faster in GPU runtime and uses 33% fewer GPU memory ``` _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 permute_multi_embs | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 1011.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 permute_multi_embs_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 3.2 ms | Memory (P90): 1011.0 ``` # traces * [files](https://drive.google.com/drive/folders/1_9hOtQUQeFICBVxQtusvpQ_VajduFUmR?usp=sharing) ``` [[email protected] /data/sandcastle/boxes/fbsource (ae677c240)]$ ll *.json -rw-rw-r-- 1 hhy hhy 8062993 Jun 21 23:26 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 949610 Jun 21 23:26 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140143 Jun 21 23:26 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350370 Jun 21 23:26 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 581033 Jun 21 23:26 trace-permute_multi_embs_dup.json -rw-rw-r-- 1 hhy hhy 582607 Jun 21 23:26 trace-permute_multi_embs.json -rw-rw-r-- 1 hhy hhy 8025337 Jun 21 23:26 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041586 Jun 21 23:26 trace-_regroup_keyed_tenors.json ``` * native-pytorch {F1713052022} * current prod {F1713052648} * new op {F1713052907} * runtime |item|CPU runtime|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**native-pytorch**|3.9 ms|3.1 ms|1.0 K|CPU-bounded| |**prod op**|2.1 ms|4.9 ms|1.5 K|GPU-boudned due to torch.cat| |**new op**|2.0 K|2.2 ms|1.0 K|both CPU and GPU runtime outperformed| Differential Revision: D58906839
facebook-github-bot
pushed a commit
that referenced
this pull request
Jun 22, 2024
Summary: X-link: pytorch/FBGEMM#2771 # context * added both **op-level** and **fn-level** benchmarks for the KT.regroup implementations * analyze the op-level and fn-level performance in runtime and memory usage * findings are that: **a**. In the fn-level performance, the `permute_multi_embedding` (new op) outperforms both the native-pytorch implementation and the `permute_pooled_embs_auto_grad` (current Prod) by 50% GPU runtime and 33% memory usage **b**. In the op-level performance, the new op is slightly slower than the current prod (by ~5% GPU runtime) * conclusion: **we should use the new op** # other considerations The good: 1. the algorithm is designed in a way that it doesn't need to know in advance whether the 1-to-N mapping exists in the permutes. 2. `_all_keys_used_once` is no longer needed 3. no longer need a torch.cat before calling the old operator 4. no need to use `_pin_and_move` for the meta data (arguments), it will be handled inside the operator, it's more friendly to tracing. The same bad: 1. it requires several HtoD communications (move tensor to device): a) [resolved] 3 tensors, which are `permutes`, `input_lengths`, and `output_lengths`. Those tensors needs to be on the device so that the cuda kernels has access to it. b) [resolved] 2 lists of (scalar_t*) pointers, input and output tensor lists. c) [resolved] Didn't find a good way to let the kernel knows the address of the lists of input/output tensors, because the lists are also need to be on the device. 2. tensor.contiguous for the backward function, it looks like the grad from the backward are somehow not contiguous. # benchmark * op-level results: new op is ~5% slower in GPU runtime ``` INFO:root:size: 1024 x 136896; permute_multi_embedding: 2.25 ms; permute_pooled_embs: 2.15 ms; delta: 4.5% INFO:root:size: 1024 x 108432; permute_multi_embedding: 1.79 ms; permute_pooled_embs: 1.7 ms; delta: 5.3% INFO:root:size: 1024 x 277232; permute_multi_embedding: 4.54 ms; permute_pooled_embs: 4.37 ms; delta: 3.9% INFO:root:size: 1024 x 244352; permute_multi_embedding: 4.01 ms; permute_pooled_embs: 3.83 ms; delta: 4.9% INFO:root:size: 1024 x 524224; permute_multi_embedding: 8.62 ms; permute_pooled_embs: 8.25 ms; delta: 4.5% INFO:root:size: 1024 x 564080; permute_multi_embedding: 9.27 ms; permute_pooled_embs: 8.92 ms; delta: 3.9% ``` * fn-level results: new op is 50%+ faster in GPU runtime and uses 33% fewer GPU memory ``` _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 permute_multi_embs | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 1011.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 permute_multi_embs_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 3.2 ms | Memory (P90): 1011.0 ``` # traces * [files](https://drive.google.com/drive/folders/1_9hOtQUQeFICBVxQtusvpQ_VajduFUmR?usp=sharing) ``` [[email protected] /data/sandcastle/boxes/fbsource (ae677c240)]$ ll *.json -rw-rw-r-- 1 hhy hhy 8062993 Jun 21 23:26 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 949610 Jun 21 23:26 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140143 Jun 21 23:26 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350370 Jun 21 23:26 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 581033 Jun 21 23:26 trace-permute_multi_embs_dup.json -rw-rw-r-- 1 hhy hhy 582607 Jun 21 23:26 trace-permute_multi_embs.json -rw-rw-r-- 1 hhy hhy 8025337 Jun 21 23:26 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041586 Jun 21 23:26 trace-_regroup_keyed_tenors.json ``` * native-pytorch {F1713052022} * current prod {F1713052648} * new op {F1713052907} * runtime |item|CPU runtime|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**native-pytorch**|3.9 ms|3.1 ms|1.0 K|CPU-bounded| |**prod op**|2.1 ms|4.9 ms|1.5 K|GPU-boudned due to torch.cat| |**new op**|2.0 K|2.2 ms|1.0 K|both CPU and GPU runtime outperformed| Differential Revision: D58906839
facebook-github-bot
pushed a commit
to pytorch/FBGEMM
that referenced
this pull request
Jun 22, 2024
Summary: X-link: pytorch/torchrec#2158 # context * added both **op-level** and **fn-level** benchmarks for the KT.regroup implementations * analyze the op-level and fn-level performance in runtime and memory usage * findings are that: **a**. In the fn-level performance, the `permute_multi_embedding` (new op) outperforms both the native-pytorch implementation and the `permute_pooled_embs_auto_grad` (current Prod) by 50% GPU runtime and 33% memory usage **b**. In the op-level performance, the new op is slightly slower than the current prod (by ~5% GPU runtime) * conclusion: **we should use the new op** # other considerations The good: 1. the algorithm is designed in a way that it doesn't need to know in advance whether the 1-to-N mapping exists in the permutes. 2. `_all_keys_used_once` is no longer needed 3. no longer need a torch.cat before calling the old operator 4. no need to use `_pin_and_move` for the meta data (arguments), it will be handled inside the operator, it's more friendly to tracing. The same bad: 1. it requires several HtoD communications (move tensor to device): a) [resolved] 3 tensors, which are `permutes`, `input_lengths`, and `output_lengths`. Those tensors needs to be on the device so that the cuda kernels has access to it. b) [resolved] 2 lists of (scalar_t*) pointers, input and output tensor lists. c) [resolved] Didn't find a good way to let the kernel knows the address of the lists of input/output tensors, because the lists are also need to be on the device. 2. tensor.contiguous for the backward function, it looks like the grad from the backward are somehow not contiguous. # benchmark * op-level results: new op is ~5% slower in GPU runtime ``` INFO:root:size: 1024 x 136896; permute_multi_embedding: 2.25 ms; permute_pooled_embs: 2.15 ms; delta: 4.5% INFO:root:size: 1024 x 108432; permute_multi_embedding: 1.79 ms; permute_pooled_embs: 1.7 ms; delta: 5.3% INFO:root:size: 1024 x 277232; permute_multi_embedding: 4.54 ms; permute_pooled_embs: 4.37 ms; delta: 3.9% INFO:root:size: 1024 x 244352; permute_multi_embedding: 4.01 ms; permute_pooled_embs: 3.83 ms; delta: 4.9% INFO:root:size: 1024 x 524224; permute_multi_embedding: 8.62 ms; permute_pooled_embs: 8.25 ms; delta: 4.5% INFO:root:size: 1024 x 564080; permute_multi_embedding: 9.27 ms; permute_pooled_embs: 8.92 ms; delta: 3.9% ``` * fn-level results: new op is 50%+ faster in GPU runtime and uses 33% fewer GPU memory ``` _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 permute_multi_embs | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 1011.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 permute_multi_embs_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 3.2 ms | Memory (P90): 1011.0 ``` # traces * [files](https://drive.google.com/drive/folders/1_9hOtQUQeFICBVxQtusvpQ_VajduFUmR?usp=sharing) ``` [[email protected] /data/sandcastle/boxes/fbsource (ae677c240)]$ ll *.json -rw-rw-r-- 1 hhy hhy 8062993 Jun 21 23:26 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 949610 Jun 21 23:26 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140143 Jun 21 23:26 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350370 Jun 21 23:26 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 581033 Jun 21 23:26 trace-permute_multi_embs_dup.json -rw-rw-r-- 1 hhy hhy 582607 Jun 21 23:26 trace-permute_multi_embs.json -rw-rw-r-- 1 hhy hhy 8025337 Jun 21 23:26 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041586 Jun 21 23:26 trace-_regroup_keyed_tenors.json ``` * native-pytorch {F1713052022} * current prod {F1713052648} * new op {F1713052907} * runtime |item|CPU runtime|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**native-pytorch**|3.9 ms|3.1 ms|1.0 K|CPU-bounded| |**prod op**|2.1 ms|4.9 ms|1.5 K|GPU-boudned due to torch.cat| |**new op**|2.0 K|2.2 ms|1.0 K|both CPU and GPU runtime outperformed| Differential Revision: D58906839
facebook-github-bot
pushed a commit
that referenced
this pull request
Jun 22, 2024
Summary: X-link: pytorch/FBGEMM#2771 # context * added both **op-level** and **fn-level** benchmarks for the KT.regroup implementations * analyze the op-level and fn-level performance in runtime and memory usage * findings are that: **a**. In the fn-level performance, the `permute_multi_embedding` (new op) outperforms both the native-pytorch implementation and the `permute_pooled_embs_auto_grad` (current Prod) by 50% GPU runtime and 33% memory usage **b**. In the op-level performance, the new op is slightly slower than the current prod (by ~5% GPU runtime) * conclusion: **we should use the new op** # other considerations The good: 1. the algorithm is designed in a way that it doesn't need to know in advance whether the 1-to-N mapping exists in the permutes. 2. `_all_keys_used_once` is no longer needed 3. no longer need a torch.cat before calling the old operator 4. no need to use `_pin_and_move` for the meta data (arguments), it will be handled inside the operator, it's more friendly to tracing. 5. no longer need to fallback to native-pytorch implementation when duplicates existed The same bad: 1. it requires several HtoD communications (move tensor to device): a) [resolved] 3 tensors, which are `permutes`, `input_lengths`, and `output_lengths`. Those tensors needs to be on the device so that the cuda kernels has access to it. b) [resolved] 2 lists of (scalar_t*) pointers, input and output tensor lists. c) [resolved] Didn't find a good way to let the kernel knows the address of the lists of input/output tensors, because the lists are also need to be on the device. 2. tensor.contiguous for the backward function, it looks like the grad from the backward are somehow not contiguous. # benchmark * op-level results: new op is ~5% slower in GPU runtime ``` INFO:root:size: 1024 x 136896; permute_multi_embedding: 2.25 ms; permute_pooled_embs: 2.15 ms; delta: 4.5% INFO:root:size: 1024 x 108432; permute_multi_embedding: 1.79 ms; permute_pooled_embs: 1.7 ms; delta: 5.3% INFO:root:size: 1024 x 277232; permute_multi_embedding: 4.54 ms; permute_pooled_embs: 4.37 ms; delta: 3.9% INFO:root:size: 1024 x 244352; permute_multi_embedding: 4.01 ms; permute_pooled_embs: 3.83 ms; delta: 4.9% INFO:root:size: 1024 x 524224; permute_multi_embedding: 8.62 ms; permute_pooled_embs: 8.25 ms; delta: 4.5% INFO:root:size: 1024 x 564080; permute_multi_embedding: 9.27 ms; permute_pooled_embs: 8.92 ms; delta: 3.9% ``` * fn-level results: new op is 50%+ faster in GPU runtime and uses 33% fewer GPU memory ``` _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 permute_multi_embs | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 1011.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 permute_multi_embs_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 3.2 ms | Memory (P90): 1011.0 ``` # traces * [files](https://drive.google.com/drive/folders/1_9hOtQUQeFICBVxQtusvpQ_VajduFUmR?usp=sharing) ``` [[email protected] /data/sandcastle/boxes/fbsource (ae677c240)]$ ll *.json -rw-rw-r-- 1 hhy hhy 8062993 Jun 21 23:26 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 949610 Jun 21 23:26 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140143 Jun 21 23:26 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350370 Jun 21 23:26 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 581033 Jun 21 23:26 trace-permute_multi_embs_dup.json -rw-rw-r-- 1 hhy hhy 582607 Jun 21 23:26 trace-permute_multi_embs.json -rw-rw-r-- 1 hhy hhy 8025337 Jun 21 23:26 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041586 Jun 21 23:26 trace-_regroup_keyed_tenors.json ``` * native-pytorch {F1713052022} * current prod {F1713052648} * new op {F1713052907} * runtime |Operator|CPU runtime|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**native-pytorch**|3.9 ms|3.1 ms|1.0 K|CPU-bounded, allow duplicates| |**prod op**|2.1 ms|4.9 ms|1.5 K|GPU-boudned due to torch.cat, does **NOT** allow duplicates| |**new op**|2.0 ms|2.2 ms|1.0 K|both CPU and GPU runtime outperformed, **ALLOW** duplicates| Differential Revision: D58906839
facebook-github-bot
pushed a commit
to pytorch/FBGEMM
that referenced
this pull request
Jul 9, 2024
Summary: X-link: pytorch/torchrec#2158 Pull Request resolved: #2771 # context * added both **op-level** and **fn-level** benchmarks for the KT.regroup implementations * analyze the op-level and fn-level performance in runtime and memory usage * findings are that: **a**. In the fn-level performance, the `permute_multi_embedding` (new op) outperforms both the native-pytorch implementation and the `permute_pooled_embs_auto_grad` (current Prod) by 50% GPU runtime and 33% memory usage **b**. In the op-level performance, the new op is slightly slower than the current prod (by ~5% GPU runtime) * conclusion: **we should use the new op** # other considerations The good: 1. the algorithm is designed in a way that it doesn't need to know in advance whether the 1-to-N mapping exists in the permutes. 2. `_all_keys_used_once` is no longer needed 3. no longer need a torch.cat before calling the old operator 4. no need to use `_pin_and_move` for the meta data (arguments), it will be handled inside the operator, it's more friendly to tracing. 5. no longer need to fallback to native-pytorch implementation when duplicates existed The same bad: 1. it requires several HtoD communications (move tensor to device): a) [resolved] 3 tensors, which are `permutes`, `input_lengths`, and `output_lengths`. Those tensors needs to be on the device so that the cuda kernels has access to it. b) [resolved] 2 lists of (scalar_t*) pointers, input and output tensor lists. c) [resolved] Didn't find a good way to let the kernel knows the address of the lists of input/output tensors, because the lists are also need to be on the device. 2. tensor.contiguous for the backward function, it looks like the grad from the backward are somehow not contiguous. # benchmark * op-level results: new op is ~5% slower in GPU runtime ``` INFO:root:size: 1024 x 136896; permute_multi_embedding: 2.25 ms; permute_pooled_embs: 2.15 ms; delta: 4.5% INFO:root:size: 1024 x 108432; permute_multi_embedding: 1.79 ms; permute_pooled_embs: 1.7 ms; delta: 5.3% INFO:root:size: 1024 x 277232; permute_multi_embedding: 4.54 ms; permute_pooled_embs: 4.37 ms; delta: 3.9% INFO:root:size: 1024 x 244352; permute_multi_embedding: 4.01 ms; permute_pooled_embs: 3.83 ms; delta: 4.9% INFO:root:size: 1024 x 524224; permute_multi_embedding: 8.62 ms; permute_pooled_embs: 8.25 ms; delta: 4.5% INFO:root:size: 1024 x 564080; permute_multi_embedding: 9.27 ms; permute_pooled_embs: 8.92 ms; delta: 3.9% ``` * fn-level results: new op is 50%+ faster in GPU runtime and uses 33% fewer GPU memory ``` _regroup_keyed_tenors | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1011.0 KeyedTensor.regroup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 5.0 ms | Memory (P90): 1517.0 KTRegroupAsDict | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 4.9 ms | Memory (P90): 1517.0 permute_multi_embs | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 1011.0 _regroup_keyed_tenors_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KeyedTensor.regroup_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 KTRegroupAsDict_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 1011.0 permute_multi_embs_dup | B: 1024 | F: 1020 | device: cuda | Runtime (P90): 3.2 ms | Memory (P90): 1011.0 ``` # traces * [files](https://drive.google.com/drive/folders/1_9hOtQUQeFICBVxQtusvpQ_VajduFUmR?usp=sharing) ``` [[email protected] /data/sandcastle/boxes/fbsource (ae677c240)]$ ll *.json -rw-rw-r-- 1 hhy hhy 8062993 Jun 21 23:26 trace-KeyedTensor.regroup_dup.json -rw-rw-r-- 1 hhy hhy 949610 Jun 21 23:26 trace-KeyedTensor.regroup.json -rw-rw-r-- 1 hhy hhy 5140143 Jun 21 23:26 trace-KTRegroupAsDict_dup.json -rw-rw-r-- 1 hhy hhy 350370 Jun 21 23:26 trace-KTRegroupAsDict.json -rw-rw-r-- 1 hhy hhy 581033 Jun 21 23:26 trace-permute_multi_embs_dup.json -rw-rw-r-- 1 hhy hhy 582607 Jun 21 23:26 trace-permute_multi_embs.json -rw-rw-r-- 1 hhy hhy 8025337 Jun 21 23:26 trace-_regroup_keyed_tenors_dup.json -rw-rw-r-- 1 hhy hhy 8041586 Jun 21 23:26 trace-_regroup_keyed_tenors.json ``` * native-pytorch {F1713052022} * current prod {F1713052648} * new op {F1713052907} * runtime |Operator|CPU runtime|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**native-pytorch**|3.9 ms|3.1 ms|1.0 K|CPU-bounded, allow duplicates| |**prod op**|2.1 ms|4.9 ms|1.5 K|GPU-boudned due to torch.cat, does **NOT** allow duplicates| |**new op**|2.0 ms|2.2 ms|1.0 K|both CPU and GPU runtime outperformed, **ALLOW** duplicates| Reviewed By: dstaay-fb Differential Revision: D58906839 fbshipit-source-id: 6cb28ca17daf16943b28af9b074d1032e7079912
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
fb-exported
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
performance notes
The good:
_all_keys_used_once
is no longer needed_pin_and_move
for the meta data (arguments), it will be handled inside the operator, it's more friendly to tracing.The same bad:
a) [resolved] 3 tensors, which are
permutes
,input_lengths
, andoutput_lengths
. Those tensors needs to be on the device so that the cuda kernels has access to it.b) [resolved] 2 lists of (scalar_t*) pointers, input and output tensor lists.
c) [resolved] Didn't find a good way to let the kernel knows the address of the lists of input/output tensors, because the lists are also need to be on the device.
benchmark
traces
Differential Revision: D58906839