-
Notifications
You must be signed in to change notification settings - Fork 523
[KT.regroup Ops][6/N] use new op in KTRegroupAsDict module #2210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D53590566 |
This pull request was exported from Phabricator. Differential Revision: D53590566 |
86f1984
to
7db7fb4
Compare
Summary: Pull Request resolved: pytorch#2210 # context * adding PackedTensorAccessor for passing the index tensor to kernel * GPU trace reading slows down from 2.20ms to 2.26ms # traces * previous ~4.90s {F1747994738} * after ~2.00ms {F1747994032} Differential Revision: D53590566
This pull request was exported from Phabricator. Differential Revision: D53590566 |
7db7fb4
to
78be7d3
Compare
Summary: Pull Request resolved: pytorch#2210 # context * adding PackedTensorAccessor for passing the index tensor to kernel * GPU trace reading slows down from 2.20ms to 2.26ms # traces * previous ~4.90s {F1747994738} * after ~2.00ms {F1747994032} Differential Revision: D53590566
This pull request was exported from Phabricator. Differential Revision: D53590566 |
78be7d3
to
d359594
Compare
Summary: Pull Request resolved: pytorch#2210 # context * adding PackedTensorAccessor for passing the index tensor to kernel * GPU trace reading slows down from 2.20ms to 2.26ms # traces * previous ~4.90s {F1747994738} * after ~2.00ms {F1747994032} Differential Revision: D53590566
This pull request was exported from Phabricator. Differential Revision: D53590566 |
d359594
to
fc76364
Compare
Summary: Pull Request resolved: pytorch#2210 # context * adding PackedTensorAccessor for passing the index tensor to kernel * GPU trace reading slows down from 2.20ms to 2.26ms # traces * previous ~4.90s {F1747994738} * after ~2.00ms {F1747994032} Differential Revision: D53590566
This pull request was exported from Phabricator. Differential Revision: D53590566 |
Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566
fc76364
to
926646f
Compare
Summary: Pull Request resolved: pytorch#2210 # context * he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Differential Revision: D53590566
Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566
926646f
to
4544b58
Compare
This pull request was exported from Phabricator. Differential Revision: D53590566 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D53590566 |
Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566
4544b58
to
daae7a0
Compare
This pull request was exported from Phabricator. Differential Revision: D53590566 |
Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566
daae7a0
to
954d652
Compare
Summary: # context * previously `KTRegroupAsDict` can't really supported by torch.export (IR) because this module has an intialization step as running the first batch. Differential Revision: D57578012
Summary: Pull Request resolved: pytorch#2210 # context * he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Differential Revision: D53590566
This pull request was exported from Phabricator. Differential Revision: D53590566 |
Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566
954d652
to
196d544
Compare
Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566
This pull request was exported from Phabricator. Differential Revision: D53590566 |
196d544
to
b353525
Compare
Summary: Pull Request resolved: pytorch#2210 # context * he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Differential Revision: D53590566
Summary: Pull Request resolved: pytorch#2210 # context * he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Differential Revision: D53590566
Summary: Pull Request resolved: #2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566 fbshipit-source-id: 220878f99111fabc3de8a0ba83d319b36ee519f6
Summary:
context
permute_multi_embedding
outperforms the original oppermute_pooled_embs_auto_grad
benchmark
{F1747994738}
{F1747994032}
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|[previous prod] permute_pooled_embs|4.9 ms|1.5 K|GPU-boudned, does NOT allow duplicates, PT2 non-compatible
pin_and_move
||[new prod] permute_multi_embedding|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, ALLOW duplicates, PT2 friendly|
Differential Revision: D53590566