Skip to content

[KT.regroup Ops][6/N] use new op in KTRegroupAsDict module #2210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

TroyGarden
Copy link
Contributor

@TroyGarden TroyGarden commented Jul 7, 2024

Summary:

context

  • he new op permute_multi_embedding outperforms the original op permute_pooled_embs_auto_grad
  • this diff makes the move to switch to the new op
  • benchmark results: D58907223

benchmark

  • traces
  • previous prod
    {F1747994738}
  • new prod
    {F1747994032}
  • metrics
    |Operator|GPU runtime|GPU memory|notes|
    |---|---|---|---|---|
    |[previous prod] permute_pooled_embs|4.9 ms|1.5 K|GPU-boudned, does NOT allow duplicates, PT2 non-compatible pin_and_move|
    |[new prod] permute_multi_embedding|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, ALLOW duplicates, PT2 friendly|

Differential Revision: D53590566

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 7, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Jul 7, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* adding PackedTensorAccessor for passing the index tensor to kernel
* GPU trace reading slows down from 2.20ms to 2.26ms

# traces
* previous ~4.90s
 {F1747994738}
* after ~2.00ms
 {F1747994032}

Differential Revision: D53590566
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Jul 8, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* adding PackedTensorAccessor for passing the index tensor to kernel
* GPU trace reading slows down from 2.20ms to 2.26ms

# traces
* previous ~4.90s
 {F1747994738}
* after ~2.00ms
 {F1747994032}

Differential Revision: D53590566
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Jul 9, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* adding PackedTensorAccessor for passing the index tensor to kernel
* GPU trace reading slows down from 2.20ms to 2.26ms

# traces
* previous ~4.90s
 {F1747994738}
* after ~2.00ms
 {F1747994032}

Differential Revision: D53590566
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Jul 11, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* adding PackedTensorAccessor for passing the index tensor to kernel
* GPU trace reading slows down from 2.20ms to 2.26ms

# traces
* previous ~4.90s
 {F1747994738}
* after ~2.00ms
 {F1747994032}

Differential Revision: D53590566
@TroyGarden TroyGarden changed the title use new op in KTRegroupAsDict module [KT.regroup Ops][6/N] use new op in KTRegroupAsDict module Jul 13, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Jul 26, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Reviewed By: dstaay-fb

Differential Revision: D53590566
PaulZhang12 pushed a commit to PaulZhang12/torchrec that referenced this pull request Jul 29, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Differential Revision: D53590566
TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Aug 6, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Reviewed By: dstaay-fb

Differential Revision: D53590566
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Aug 7, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Reviewed By: dstaay-fb

Differential Revision: D53590566
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Aug 7, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Reviewed By: dstaay-fb

Differential Revision: D53590566
@TroyGarden TroyGarden closed this Aug 8, 2024
@TroyGarden TroyGarden deleted the export-D53590566 branch August 8, 2024 22:13
Summary:
# context
* previously `KTRegroupAsDict` can't really supported by torch.export (IR) because this module has an intialization step as running the first batch.

Differential Revision: D57578012
@TroyGarden TroyGarden restored the export-D53590566 branch August 10, 2024 02:13
@TroyGarden TroyGarden reopened this Aug 10, 2024
TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Aug 10, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Differential Revision: D53590566
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Aug 10, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Reviewed By: dstaay-fb

Differential Revision: D53590566
Summary:
Pull Request resolved: pytorch#2210

# context
* the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Reviewed By: dstaay-fb

Differential Revision: D53590566
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D53590566

TroyGarden added a commit to TroyGarden/torchrec that referenced this pull request Aug 10, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Differential Revision: D53590566
PaulZhang12 pushed a commit to PaulZhang12/torchrec that referenced this pull request Aug 12, 2024
Summary:
Pull Request resolved: pytorch#2210

# context
* he new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Differential Revision: D53590566
PaulZhang12 pushed a commit that referenced this pull request Aug 19, 2024
Summary:
Pull Request resolved: #2210

# context
* the new op `permute_multi_embedding` outperforms the original op `permute_pooled_embs_auto_grad`
* this diff makes the move to switch to the new op
* benchmark results: D58907223

# benchmark
* [traces](https://drive.google.com/drive/folders/1v_kD9n1jOkGUmYyix3-dUYiBDE_C3Hiv?usp=drive_link)
* previous prod
 {F1747994738}
* new prod
 {F1747994032}
* metrics
|Operator|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[previous prod] permute_pooled_embs**|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[new prod] permute_multi_embedding**|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Reviewed By: dstaay-fb

Differential Revision: D53590566

fbshipit-source-id: 220878f99111fabc3de8a0ba83d319b36ee519f6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants