[KT.regroup Ops][6/N] use new op in KTRegroupAsDict module #2210

TroyGarden · 2024-07-07T10:08:40Z

Summary:

context

he new op permute_multi_embedding outperforms the original op permute_pooled_embs_auto_grad
this diff makes the move to switch to the new op
benchmark results: D58907223

benchmark

traces
previous prod
{F1747994738}
new prod
{F1747994032}
metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|[previous prod] permute_pooled_embs|4.9 ms|1.5 K|GPU-boudned, does NOT allow duplicates, PT2 non-compatible pin_and_move|
|[new prod] permute_multi_embedding|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, ALLOW duplicates, PT2 friendly|

Differential Revision: D53590566

facebook-github-bot · 2024-07-07T10:09:08Z

This pull request was exported from Phabricator. Differential Revision: D53590566

facebook-github-bot · 2024-07-07T22:37:18Z

This pull request was exported from Phabricator. Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * adding PackedTensorAccessor for passing the index tensor to kernel * GPU trace reading slows down from 2.20ms to 2.26ms # traces * previous ~4.90s {F1747994738} * after ~2.00ms {F1747994032} Differential Revision: D53590566

facebook-github-bot · 2024-07-08T20:01:51Z

This pull request was exported from Phabricator. Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * adding PackedTensorAccessor for passing the index tensor to kernel * GPU trace reading slows down from 2.20ms to 2.26ms # traces * previous ~4.90s {F1747994738} * after ~2.00ms {F1747994032} Differential Revision: D53590566

facebook-github-bot · 2024-07-09T08:51:06Z

This pull request was exported from Phabricator. Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * adding PackedTensorAccessor for passing the index tensor to kernel * GPU trace reading slows down from 2.20ms to 2.26ms # traces * previous ~4.90s {F1747994738} * after ~2.00ms {F1747994032} Differential Revision: D53590566

facebook-github-bot · 2024-07-11T10:23:43Z

This pull request was exported from Phabricator. Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * adding PackedTensorAccessor for passing the index tensor to kernel * GPU trace reading slows down from 2.20ms to 2.26ms # traces * previous ~4.90s {F1747994738} * after ~2.00ms {F1747994032} Differential Revision: D53590566

facebook-github-bot · 2024-07-26T01:47:45Z

This pull request was exported from Phabricator. Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566

facebook-github-bot · 2024-08-06T09:22:54Z

This pull request was exported from Phabricator. Differential Revision: D53590566

facebook-github-bot · 2024-08-07T01:53:16Z

This pull request was exported from Phabricator. Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566

facebook-github-bot · 2024-08-07T09:24:49Z

This pull request was exported from Phabricator. Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566

Summary: # context * previously `KTRegroupAsDict` can't really supported by torch.export (IR) because this module has an intialization step as running the first batch. Differential Revision: D57578012

Summary: Pull Request resolved: pytorch#2210 # context * he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Differential Revision: D53590566

facebook-github-bot · 2024-08-10T02:19:30Z

This pull request was exported from Phabricator. Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566

facebook-github-bot · 2024-08-10T04:35:50Z

This pull request was exported from Phabricator. Differential Revision: D53590566

Summary: Pull Request resolved: pytorch#2210 # context * he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Differential Revision: D53590566

Summary: Pull Request resolved: #2210 # context * the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad` * this diff makes the move to switch to the new op * benchmark results: D58907223 # benchmark * [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link) * previous prod {F1747994738} * new prod {F1747994032} * metrics |Operator|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D53590566 fbshipit-source-id: 220878f99111fabc3de8a0ba83d319b36ee519f6

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 7, 2024

facebook-github-bot added the fb-exported label Jul 7, 2024

TroyGarden force-pushed the export-D53590566 branch from 86f1984 to 7db7fb4 Compare July 7, 2024 22:37

TroyGarden force-pushed the export-D53590566 branch from 7db7fb4 to 78be7d3 Compare July 8, 2024 20:01

TroyGarden force-pushed the export-D53590566 branch from 78be7d3 to d359594 Compare July 9, 2024 08:51

TroyGarden force-pushed the export-D53590566 branch from d359594 to fc76364 Compare July 11, 2024 10:23

TroyGarden changed the title ~~use new op in KTRegroupAsDict module~~ [KT.regroup Ops][6/N] use new op in KTRegroupAsDict module Jul 13, 2024

TroyGarden force-pushed the export-D53590566 branch from fc76364 to 926646f Compare July 26, 2024 01:47

TroyGarden force-pushed the export-D53590566 branch from 926646f to 4544b58 Compare August 6, 2024 09:22

TroyGarden force-pushed the export-D53590566 branch from 4544b58 to daae7a0 Compare August 7, 2024 01:53

TroyGarden force-pushed the export-D53590566 branch from daae7a0 to 954d652 Compare August 7, 2024 09:24

TroyGarden closed this Aug 8, 2024

TroyGarden deleted the export-D53590566 branch August 8, 2024 22:13

Add IR serializer for KTRegroupAsDict Module

c4347a2

Summary: # context * previously `KTRegroupAsDict` can't really supported by torch.export (IR) because this module has an intialization step as running the first batch. Differential Revision: D57578012

TroyGarden restored the export-D53590566 branch August 10, 2024 02:13

TroyGarden reopened this Aug 10, 2024

TroyGarden force-pushed the export-D53590566 branch from 954d652 to 196d544 Compare August 10, 2024 02:19

TroyGarden force-pushed the export-D53590566 branch from 196d544 to b353525 Compare August 10, 2024 04:35

facebook-github-bot closed this in 5d9b2db Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[KT.regroup Ops][6/N] use new op in KTRegroupAsDict module #2210

[KT.regroup Ops][6/N] use new op in KTRegroupAsDict module #2210

Uh oh!

TroyGarden commented Jul 7, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Jul 7, 2024

Uh oh!

facebook-github-bot commented Jul 7, 2024

Uh oh!

facebook-github-bot commented Jul 8, 2024

Uh oh!

facebook-github-bot commented Jul 9, 2024

Uh oh!

facebook-github-bot commented Jul 11, 2024

Uh oh!

facebook-github-bot commented Jul 26, 2024

Uh oh!

facebook-github-bot commented Aug 6, 2024

Uh oh!

facebook-github-bot commented Aug 7, 2024

Uh oh!

facebook-github-bot commented Aug 7, 2024

Uh oh!

facebook-github-bot commented Aug 10, 2024

Uh oh!

facebook-github-bot commented Aug 10, 2024

Uh oh!

Uh oh!

[KT.regroup Ops][6/N] use new op in KTRegroupAsDict module #2210

[KT.regroup Ops][6/N] use new op in KTRegroupAsDict module #2210

Uh oh!

Conversation

TroyGarden commented Jul 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

context

benchmark

Uh oh!

facebook-github-bot commented Jul 7, 2024

Uh oh!

facebook-github-bot commented Jul 7, 2024

Uh oh!

facebook-github-bot commented Jul 8, 2024

Uh oh!

facebook-github-bot commented Jul 9, 2024

Uh oh!

facebook-github-bot commented Jul 11, 2024

Uh oh!

facebook-github-bot commented Jul 26, 2024

Uh oh!

facebook-github-bot commented Aug 6, 2024

Uh oh!

facebook-github-bot commented Aug 7, 2024

Uh oh!

facebook-github-bot commented Aug 7, 2024

Uh oh!

facebook-github-bot commented Aug 10, 2024

Uh oh!

facebook-github-bot commented Aug 10, 2024

Uh oh!

Uh oh!

TroyGarden commented Jul 7, 2024 •

edited

Loading