Qualcomm AI Engine Direct - Enable zero copy feature #2531

shewu-quic · 2024-03-20T09:20:27Z

Summary:

Add argument "shared_buffer" into compiler_spec, qnn_executor_runner and test scripts
- Actually, shared_buffer should be a runtime option since user are responsible to allocate memory for tensors on device. But it seems to have no way to set the runtime option to QnnBackend. Therefore, we put it to compile_spec for now.
Implement SharedBuffer to allocate and free RPC memory
Add QnnMemManger to register shared buffer for tensor
- During exection time, we will register memory of tensor data for QNN. And we will deregister them during destruction time of QnnBackend
Add two API void* QnnExecuTorchAllocCustomMem(size_t bytes, size_t alignment) and void QnnExecuTorchFreeCustomMem(void* buffer_ptr) to allocate RPC memory with SharedBuffer
- Users are responsible to allocate "enough" tensor bytes, and set alignment as MemoryAllocator::kDefaultAlignment. See runtime/core/memory_allocator.h.

pytorch-bot · 2024-03-20T09:20:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2531

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 1 Unrelated Failure

As of commit e541fb4 with merge base 7f96f5a ():

NEW FAILURES - The following jobs have failed:

pull / test-llama-runner-linux (fp16, buck2) / linux-job (gh)
RuntimeError: Command docker exec -t db46c3b5bef390f08c7ff687f399c4894c50543c6506260b8621a219b4355d84 /exec failed with exit code 2
pull / test-llama-runner-linux (fp16, cmake) / linux-job (gh)
RuntimeError: Command docker exec -t 0f89a2581ac9c8af2d4b9b16dc2cb8f8ffbd6d0ef6b9639d33772ea0ed13ffb5 /exec failed with exit code 2
pull / test-llama-runner-linux (fp32, buck2) / linux-job (gh)
RuntimeError: Failed to compile /tmp/tmpah9x4wjy/data.json to /tmp/tmpah9x4wjy/data.pte. Set ET_EXIR_SAVE_FLATC_INPUTS_ON_FAILURE=1 to save input files on failure.
pull / test-llama-runner-linux (fp32, cmake) / linux-job (gh)
RuntimeError: Failed to compile /tmp/tmpirb3mwpr/data.json to /tmp/tmpirb3mwpr/data.pte. Set ET_EXIR_SAVE_FLATC_INPUTS_ON_FAILURE=1 to save input files on failure.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / macos (buck2) / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 5

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-03-20T18:08:44Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2024-03-21T02:53:23Z

Hey do you mind rebasing again? As #2506 was landed and there is a merge conflict

shewu-quic · 2024-03-21T04:02:26Z

Thanks for your reminder.
I have rebased my PR.
But I find some errors for quantized test, it seems fail to get_fake_program in backend_api.py.
Do you have any ideas about it or I miss something?

Test in pytorch/main branch.

python3 backends/qualcomm/tests/test_qnn_delegate.py TestQNNQuantizedOperator.test_qnn_backend_16a4w_conv2d -b build_android -s $DEVICE -H $HOST -m $MODEL -r executorch -a unit_test

cccclai · 2024-03-21T07:01:43Z

Oh that might be related to a recent change in #2502. cc: @lucylq

lucylq · 2024-03-21T16:27:03Z

Oh that might be related to a recent change in #2502. cc: @lucylq

Thanks - taking a look!

Summary: Catch the general Exception case, not only AssertionError. D55047794 caused an error with QC: #2531 Reviewed By: cccclai Differential Revision: D55204315 fbshipit-source-id: 35df9d74bd55d329cbcab03e5c70e289b45796db

lucylq · 2024-03-22T13:46:19Z

Hey @shewu-quic, do you mind rebasing and trying again? Added a fallback here: #2564

shewu-quic · 2024-03-25T05:53:13Z

Done. Thanks for your help. :)

facebook-github-bot · 2024-03-25T06:14:13Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: - Add argument "shared_buffer" into compiler_spec, qnn_executor_runner and test scripts - Actually, shared_buffer should be a runtime option since user are responsible to allocate memory for tensors on device. But it seems to have no way to set the runtime option to QnnBackend. Therefore, we put it to compile_spec for now. - Implement SharedBuffer to allocate and free RPC memory - Add QnnMemManger to register shared buffer for tensor - During exection time, we will register memory of tensor data for QNN. And we will deregister them during destruction time of QnnBackend - Add two API `void* QnnExecuTorchAllocCustomMem(size_t bytes, size_t alignment)` and `void QnnExecuTorchFreeCustomMem(void* buffer_ptr)` to allocate RPC memory with SharedBuffer - Users are responsible to allocate "enough" tensor bytes, and set alignment as MemoryAllocator::kDefaultAlignment. See runtime/core/memory_allocator.h.

facebook-github-bot · 2024-03-25T07:26:05Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-03-25T08:44:55Z

@cccclai merged this pull request in a531ca5.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 20, 2024

shewu-quic force-pushed the dev/hutton/enable_zero_copy branch from 250005d to 55c504d Compare March 20, 2024 09:28

shewu-quic force-pushed the dev/hutton/enable_zero_copy branch from 55c504d to a0b82f3 Compare March 21, 2024 03:51

shewu-quic force-pushed the dev/hutton/enable_zero_copy branch from a0b82f3 to 460f564 Compare March 25, 2024 05:52

shewu-quic force-pushed the dev/hutton/enable_zero_copy branch from 460f564 to e541fb4 Compare March 25, 2024 06:22

cccclai approved these changes Mar 25, 2024

View reviewed changes

facebook-github-bot closed this in a531ca5 Mar 25, 2024

facebook-github-bot added the Merged label Mar 25, 2024

haowhsu-quic mentioned this pull request Apr 11, 2024

[Draft] Qualcomm AI Engine Direct - Support kv_cached llama2 model #2966

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qualcomm AI Engine Direct - Enable zero copy feature #2531

Qualcomm AI Engine Direct - Enable zero copy feature #2531

Uh oh!

shewu-quic commented Mar 20, 2024

Uh oh!

pytorch-bot bot commented Mar 20, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 20, 2024

Uh oh!

cccclai commented Mar 21, 2024

Uh oh!

shewu-quic commented Mar 21, 2024

Uh oh!

cccclai commented Mar 21, 2024

Uh oh!

lucylq commented Mar 21, 2024

Uh oh!

lucylq commented Mar 22, 2024

Uh oh!

shewu-quic commented Mar 25, 2024

Uh oh!

facebook-github-bot commented Mar 25, 2024

Uh oh!

facebook-github-bot commented Mar 25, 2024

Uh oh!

facebook-github-bot commented Mar 25, 2024

Uh oh!

Uh oh!

Qualcomm AI Engine Direct - Enable zero copy feature #2531

Qualcomm AI Engine Direct - Enable zero copy feature #2531

Uh oh!

Conversation

shewu-quic commented Mar 20, 2024

Uh oh!

pytorch-bot bot commented Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2531

❌ 4 New Failures, 1 Unrelated Failure

Uh oh!

facebook-github-bot commented Mar 20, 2024

Uh oh!

cccclai commented Mar 21, 2024

Uh oh!

shewu-quic commented Mar 21, 2024

Uh oh!

cccclai commented Mar 21, 2024

Uh oh!

lucylq commented Mar 21, 2024

Uh oh!

lucylq commented Mar 22, 2024

Uh oh!

shewu-quic commented Mar 25, 2024

Uh oh!

facebook-github-bot commented Mar 25, 2024

Uh oh!

facebook-github-bot commented Mar 25, 2024

Uh oh!

facebook-github-bot commented Mar 25, 2024

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 20, 2024 •

edited

Loading