Skip to content

Change weight to channel-packing in Conv1d #7057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 26, 2024

Conversation

yipjustin
Copy link
Contributor

Summary:
In a model we evaluate, we have a weight tensor (256, 1, 7) for conv 1d, 256 is the out-channel count and 7 is the weight.

It leads to a non-optimal use of memory since this tensor is mapped to extents of (7 / 4, 1, 256) under weight-packing, using 1MB per tensor. Reason is that each (x, y) plane uses 4096 bytes in the test device (for both 'OPTIMAL' and 'LINEAR' tiling), despite we are using only 2 texels in each plane.

A temporarily work-around is to use channel-packing instead. Then new tensor will be (7, 1, 64), 75% less deep hence consume far less memory. Knowing that we will fetch 4 times more. But lab test shows that our model has no perf regression.

Future work:

A more optimal solution is mapping the weight tensor (out-channel, in-channel, kernel) into extents (x=out-channel, y=kernel, z=in-channel). In our case, it leads to close to optimal layout.

Reviewed By: nathanaelsee

Differential Revision: D66417572

Copy link

pytorch-bot bot commented Nov 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/7057

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6b5a494 with merge base a35cb73 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 25, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66417572

yipjustin added a commit to yipjustin/executorch that referenced this pull request Nov 25, 2024
Summary:

In a model we evaluate, we have a weight tensor (256, 1, 7) for conv 1d, 256 is the out-channel count and 7 is the weight.

It leads to a non-optimal use of memory since this tensor is mapped to extents of `(7 / 4, 1, 256)` under weight-packing, using 1MB per tensor. Reason is that each (x, y) plane uses 4096 bytes in the test device (for both 'OPTIMAL' and 'LINEAR' tiling), despite we are using only 2 texels in each plane.

A temporarily work-around is to use channel-packing instead. Then new tensor will be `(7, 1, 64)`, 75% less deep hence consume far less memory. Knowing that we will fetch 4 times more. But lab test shows that our model has no perf regression.

## Future work: 
A more optimal solution is mapping the weight tensor `(out-channel, in-channel, kernel)` into extents `(x=out-channel, y=kernel, z=in-channel)`.  In our case, it leads to close to optimal layout.

Reviewed By: nathanaelsee

Differential Revision: D66417572
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66417572

yipjustin added a commit to yipjustin/executorch that referenced this pull request Nov 25, 2024
Summary:

In a model we evaluate, we have a weight tensor (256, 1, 7) for conv 1d, 256 is the out-channel count and 7 is the weight.

It leads to a non-optimal use of memory since this tensor is mapped to extents of `(7 / 4, 1, 256)` under weight-packing, using 1MB per tensor. Reason is that each (x, y) plane uses 4096 bytes in the test device (for both 'OPTIMAL' and 'LINEAR' tiling), despite we are using only 2 texels in each plane.

A temporarily work-around is to use channel-packing instead. Then new tensor will be `(7, 1, 64)`, 75% less deep hence consume far less memory. Knowing that we will fetch 4 times more. But lab test shows that our model has no perf regression.

## Future work: 
A more optimal solution is mapping the weight tensor `(out-channel, in-channel, kernel)` into extents `(x=out-channel, y=kernel, z=in-channel)`.  In our case, it leads to close to optimal layout.

Reviewed By: nathanaelsee

Differential Revision: D66417572
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66417572

@yipjustin yipjustin added the release notes: backends [DO NOT USE] Changes to any of the backend delegates label Nov 25, 2024
yipjustin added a commit to yipjustin/executorch that referenced this pull request Nov 26, 2024
Summary:

In a model we evaluate, we have a weight tensor (256, 1, 7) for conv 1d, 256 is the out-channel count and 7 is the weight.

It leads to a non-optimal use of memory since this tensor is mapped to extents of `(7 / 4, 1, 256)` under weight-packing, using 1MB per tensor. Reason is that each (x, y) plane uses 4096 bytes in the test device (for both 'OPTIMAL' and 'LINEAR' tiling), despite we are using only 2 texels in each plane.

A temporarily work-around is to use channel-packing instead. Then new tensor will be `(7, 1, 64)`, 75% less deep hence consume far less memory. Knowing that we will fetch 4 times more. But lab test shows that our model has no perf regression.

## Future work: 
A more optimal solution is mapping the weight tensor `(out-channel, in-channel, kernel)` into extents `(x=out-channel, y=kernel, z=in-channel)`.  In our case, it leads to close to optimal layout.

Reviewed By: nathanaelsee, jorgep31415

Differential Revision: D66417572
Summary:

In a model we evaluate, we have a weight tensor (256, 1, 7) for conv 1d, 256 is the out-channel count and 7 is the weight.

It leads to a non-optimal use of memory since this tensor is mapped to extents of `(7 / 4, 1, 256)` under weight-packing, using 1MB per tensor. Reason is that each (x, y) plane uses 4096 bytes in the test device (for both 'OPTIMAL' and 'LINEAR' tiling), despite we are using only 2 texels in each plane.

A temporarily work-around is to use channel-packing instead. Then new tensor will be `(7, 1, 64)`, 75% less deep hence consume far less memory. Knowing that we will fetch 4 times more. But lab test shows that our model has no perf regression.

## Future work: 
A more optimal solution is mapping the weight tensor `(out-channel, in-channel, kernel)` into extents `(x=out-channel, y=kernel, z=in-channel)`.  In our case, it leads to close to optimal layout.

Reviewed By: nathanaelsee, jorgep31415

Differential Revision: D66417572
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66417572

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D66417572

@facebook-github-bot facebook-github-bot merged commit 2967302 into pytorch:main Nov 26, 2024
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported release notes: backends [DO NOT USE] Changes to any of the backend delegates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants