Qualcomm AI Engine Direct - Support kv_cached stories 110M llama2 #4142

chunit-quic · 2024-07-03T08:03:54Z

Add custom memory descirptor
Add e2e example script verified with story110M in 8a8w, 16a4w
Add qnn_llama_runner to run static LLAMA.
Add readme
Add slice op test
Change RemoveClone to RemoveRedundancy
Change SimpleADB parameter artifact to build_path and related codes
Change multihead attentions to multiple single head.
Move sort inputs from execute to init
Remove split op
Support u16 and u8 mixed-precision quantization.

pytorch-bot · 2024-07-03T08:03:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4142

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Checkout action fails due to incompatible GLIBC

✅ You can merge normally! (2 Unrelated Failures)

As of commit 14859bf with merge base f32d707 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Apple / build-frameworks-ios / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/runner/work/_temp/exec_script failed with exit code 65

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

Android / test-llama-app / mobile-job (android) (gh) (#3344)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

chunit-quic · 2024-07-03T08:16:43Z

@cccclai

This PR contains our own static llama and corresponding runner. We use stories 110M for verification. Based on this PR. There are two more PR will be submitted recently.

unified model
llama 7b from ai-hub
About the second pr, we leverage the pre-compiled context binary from ai-hub and export it to pte. More details will be included in the in-coming pr.

By the way. We notice some of our test cases might encounter error if we change capture_pre_autograd_graph to torch.export. Take a test case in test_qnn_delegate.py for example (TestQNNQuantizedModel.test_qnn_backend_pixel_unshuffle_math_equivalent), it will fail at prepare_pt2e. May you kindly give us some recommendation?

Thank you very much!

chiwwang · 2024-07-03T09:07:10Z

This PR is a base for QC llama2 tasks.
The core part of this PR is changes of builders and quantizer.
Besides, we provide an example llama2 examples/qualcomm/llama2/llama.py
It has a structure friendlier for HTP.

Nonetheless, we think it's better to use the same meta llama model.
So, we continue our works in another PR #4030

Future works will be in #4030.
But this PR is a base for that.

cccclai · 2024-07-03T18:32:10Z

This is a great milestone! Thank you for putting it up. Regarding this

By the way. We notice some of our test cases might encounter error if we change capture_pre_autograd_graph to torch.export. Take a test case in test_qnn_delegate.py for example (TestQNNQuantizedModel.test_qnn_backend_pixel_unshuffle_math_equivalent), it will fail at prepare_pt2e. May you kindly give us some recommendation?

I don't see the change from capture_pre_autograd_graph to torch.export in this pr. I assume the test still pass in this pr?

facebook-github-bot · 2024-07-03T18:40:57Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2024-07-03T18:41:59Z

Another thing is - is there a way to run the test on the host machine instead of device, like simulator?

cccclai · 2024-07-03T19:04:24Z

backends/cadence/CMakeLists.txt

@@ -27,4 +27,3 @@ set(_common_include_directories ${EXECUTORCH_ROOT}/..)

 add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/hifi/operators)
 add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/hifi/kernels)


I think this is fixed. Mind rebasing again?

Done. Thank you

cccclai · 2024-07-03T19:04:55Z

backends/qualcomm/builders/op_cast.py

-
-
-@register_node_visitor
-class Cast(NodeVisitor):


What was the reason to remove it?

Pardon for missleading. We change this file to op_to.py.

The reason is that, we should convert torch to_copy op to two different backend ops, cast or convert conditionally. Therefore we make a new file for it.

cccclai · 2024-07-03T19:08:02Z

backends/qualcomm/passes/build_quant_io.py

+        call_delegate = [
+            node
+            for node in graph_module.graph.nodes
+            if node.op == "call_function" and node.name == "executorch_call_delegate"


Did you expect to run this pass after to_backend? Was it because of some issues you run into?

Thank you for pointing it out.
Yes, we intentionally invoke this pass after to_backend to change the io data type to non-fp type.

I'm trying to understand the idea. It seems like foldqdq will remove all the q/dq nodes, and insert io qdq passes will insert q/dq nodes for the subgraph inside qnn backend again. In this case, why would the io data type become non-fp type?

Right, the flow you mentioned above is exactly what we have done before. Yet in LLAMA we specifically deal with its KV IOs. Quantizing and dequantizing are meaningless for these IOs. Therefore we keep them as quantized type all the time.

Here is what we have done for these kind of tensors.
1.Tag IO nodes as a quantized type in examples/qualcomm/llama2/llama.py:340
2. Skip insert quantized/dequantized nodes in insert_io_qdq
3. Set correct quantized type to their spec in examples/qualcomm/llama2/llama.py:316

Feel free to let me know if anything is unclear to you. Thank you. :D

chunit-quic · 2024-07-04T02:30:35Z

This is a great milestone! Thank you for putting it up. Regarding this

Thank you for promp reply! Really appricate it. :D

I don't see the change from capture_pre_autograd_graph to torch.export in this pr. I assume the test still pass in this pr?

Yes, we are still using capture_pre_autograd_graph and this pr works fine. The following is the error log if we change it to torch.export.

diff --git a/backends/qualcomm/tests/utils.py b/backends/qualcomm/tests/utils.py
index 295033e5..0e615dee 100644
--- a/backends/qualcomm/tests/utils.py
+++ b/backends/qualcomm/tests/utils.py
@@ -230,7 +230,7 @@ class TestQNN(unittest.TestCase):
         custom_quant_annotations: Tuple[Callable] = (),
         quant_dtype: QuantDtype = QuantDtype.use_8a8w,
     ) -> torch.fx.GraphModule:
-        m = torch._export.capture_pre_autograd_graph(module, inputs)
+        m = torch.export.export(module, inputs).module()

         quantizer = QnnQuantizer()
         quantizer.add_custom_quant_annotations(custom_quant_annotations)

======================================================================
ERROR: test_qnn_backend_pixel_unshuffle_math_equivalent (__main__.TestQNNQuantizedModel)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "${executorch}/backends/qualcomm/tests/test_qnn_delegate.py", line 1107, in test_qnn_backend_pixel_unshuffle_math_equivalent
    module = self.get_qdq_module(module, sample_input)
  File "${executorch}/backends/qualcomm/tests/utils.py", line 253, in get_qdq_module
    prepared(*inputs)
  File "${python3.10}/site-packages/torch/fx/graph_module.py", line 738, in call_wrapped
    return self._wrapped_call(self, *args, **kwargs)
  File "${python3.10}/site-packages/torch/fx/graph_module.py", line 316, in __call__
    raise e
  File "${python3.10}/site-packages/torch/fx/graph_module.py", line 303, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "${python3.10}/site-packages/torch/nn/modules/module.py", line 1657, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "${python3.10}/site-packages/torch/nn/modules/module.py", line 1675, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.35", line 11, in forward
    _unsafe_view = torch.ops.aten._unsafe_view.default(activation_post_process_2, [2, 8, 3, 3]);  activation_post_process_2 = None
  File "${python3.10}/site-packages/torch/_ops.py", line 670, in __call__
    return self_._op(*args, **kwargs)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

----------------------------------------------------------------------
Ran 1 test in 0.111s

chunit-quic · 2024-07-04T02:34:15Z

Another thing is - is there a way to run the test on the host machine instead of device, like simulator?

Should be possible. We could link HTP lib to x86_64-linux-clang/libQnnHtp.so, and try to run on x86. Yet currently we don't have enough resource. We will add it to our TODO list.

… 110M - Add custom memory descirptor - Add e2e example script verified with story110M in 8a8w, 16a4w - Add qnn_llama_runner to run static LLAMA. - Add readme - Add slice op test - Change RemoveClone to RemoveRedundancy - Change SimpleADB parameter artifact to build_path and related codes - Change multihead attentions to multiple single head. - Move sort inputs from execute to init - Remove split op - Support u16 and u8 mixed-precision quantization.

chiwwang · 2024-07-04T05:23:35Z

Another thing is - is there a way to run the test on the host machine instead of device, like simulator?

Should be possible. We could link HTP lib to x86_64-linux-clang/libQnnHtp.so, and try to run on x86. Yet currently we don't have enough resource. We will add it to our TODO list.

Actually it has been in our TODO list for a while..... 😢

facebook-github-bot · 2024-07-04T18:37:57Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

cccclai · 2024-07-04T20:04:13Z

Another note, I noticed there are many passes running before to_backend for recomposing ops and the capture_program function control the decomp table a bit.

There is an experimental api def _to_edge_transform_and_lower landed recently to stop decomp somes op, in case you find it helpful

chunit-quic · 2024-07-05T02:02:19Z

Another note, I noticed there are many passes running before to_backend for recomposing ops and the capture_program function control the decomp table a bit.

There is an experimental api def _to_edge_transform_and_lower landed recently to stop decomp somes op, in case you find it helpful

Thank you for pointing out! We will find some time to try this function in our lowering flow.

cccclai · 2024-07-05T20:40:23Z

Sure and no pressure at all. It's an experimental api and there is no plan to replace it with the existing api. Just would like to share in case it's helpful (as we're also looking for feedback to help the devx for backend authors)

facebook-github-bot · 2024-07-05T20:42:51Z

@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-07-06T00:07:14Z

@cccclai merged this pull request in 5584b9e.

cccclai · 2024-07-08T04:03:23Z

So maybe I also need stack_trace to identify which node we want. Is it stable?

We can give it a shot. If it's problematic, we can work Compiler team to figure out a better solution

Do you mean we need to handle the life cycle of the processed in the custom op?
Originally, we load qnn context binary in init function of QnnBackend and unload it in destroy function of QnnBackend. For the life cycle of qnn context binary is decided by processed which is kept by executorch runtime framework. Is this understanding correct?

From what I understand, every shard is a qnn context binary and it almost take all the memory in dsp. Among 4 shards, we'd need to load and unload for every inference. In the meanwhile, we'll use the spill buffer to optimize memory among shards.

If we use the custom op to shard the graph, I image we'd need to destroy the previous context binary and load the next one?

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 3, 2024

This was referenced Jul 3, 2024

[Draft] Qualcomm AI Engine Direct - Support kv_cached llama2 model #2966

Closed

[Draft] Qualcomm AI Engine Direct - [WIP] llama2... #3656

Closed

[Draft] Qualcomm AI Engine Direct -Enable story llama model in quantied and fp #4030

Closed

manuelcandales assigned cccclai Jul 3, 2024

manuelcandales added the partner: qualcomm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Qualcomm label Jul 3, 2024

cccclai reviewed Jul 3, 2024

View reviewed changes

chunit-quic force-pushed the kv_cached_llama2 branch from 636ddf0 to 14859bf Compare July 4, 2024 05:21

cccclai approved these changes Jul 5, 2024

View reviewed changes

facebook-github-bot closed this in 5584b9e Jul 6, 2024

facebook-github-bot added the Merged label Jul 6, 2024

cccclai mentioned this pull request Jul 11, 2024

LLAMA Runner with QNN Build Failure in Recent Push #4201

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - Support kv_cached stories 110M llama2 #4142

Qualcomm AI Engine Direct - Support kv_cached stories 110M llama2 #4142

chunit-quic commented Jul 3, 2024

pytorch-bot bot commented Jul 3, 2024 •

edited

Loading

chunit-quic commented Jul 3, 2024 •

edited

Loading

chiwwang commented Jul 3, 2024 •

edited

Loading

cccclai commented Jul 3, 2024

facebook-github-bot commented Jul 3, 2024

cccclai commented Jul 3, 2024

cccclai Jul 3, 2024

chunit-quic Jul 4, 2024 •

edited

Loading

cccclai Jul 3, 2024

chunit-quic Jul 4, 2024

cccclai Jul 3, 2024

chunit-quic Jul 4, 2024

cccclai Jul 4, 2024

chunit-quic Jul 5, 2024

chunit-quic commented Jul 4, 2024 •

edited

Loading

chunit-quic commented Jul 4, 2024

chiwwang commented Jul 4, 2024

facebook-github-bot commented Jul 4, 2024

cccclai commented Jul 4, 2024

chunit-quic commented Jul 5, 2024

cccclai commented Jul 5, 2024

facebook-github-bot commented Jul 5, 2024

facebook-github-bot commented Jul 6, 2024

cccclai commented Jul 8, 2024

		@@ -27,4 +27,3 @@ set(_common_include_directories ${EXECUTORCH_ROOT}/..)

		add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/hifi/operators)
		add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/hifi/kernels)

Qualcomm AI Engine Direct - Support kv_cached stories 110M llama2 #4142

Qualcomm AI Engine Direct - Support kv_cached stories 110M llama2 #4142

Conversation

chunit-quic commented Jul 3, 2024

pytorch-bot bot commented Jul 3, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4142

❗ 1 Active SEVs

✅ You can merge normally! (2 Unrelated Failures)

chunit-quic commented Jul 3, 2024 • edited Loading

chiwwang commented Jul 3, 2024 • edited Loading

cccclai commented Jul 3, 2024

facebook-github-bot commented Jul 3, 2024

cccclai commented Jul 3, 2024

cccclai Jul 3, 2024

Choose a reason for hiding this comment

chunit-quic Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

cccclai Jul 3, 2024

Choose a reason for hiding this comment

chunit-quic Jul 4, 2024

Choose a reason for hiding this comment

cccclai Jul 3, 2024

Choose a reason for hiding this comment

chunit-quic Jul 4, 2024

Choose a reason for hiding this comment

cccclai Jul 4, 2024

Choose a reason for hiding this comment

chunit-quic Jul 5, 2024

Choose a reason for hiding this comment

chunit-quic commented Jul 4, 2024 • edited Loading

chunit-quic commented Jul 4, 2024

chiwwang commented Jul 4, 2024

facebook-github-bot commented Jul 4, 2024

cccclai commented Jul 4, 2024

chunit-quic commented Jul 5, 2024

cccclai commented Jul 5, 2024

facebook-github-bot commented Jul 5, 2024

facebook-github-bot commented Jul 6, 2024

cccclai commented Jul 8, 2024

pytorch-bot bot commented Jul 3, 2024 •

edited

Loading

chunit-quic commented Jul 3, 2024 •

edited

Loading

chiwwang commented Jul 3, 2024 •

edited

Loading

chunit-quic Jul 4, 2024 •

edited

Loading

chunit-quic commented Jul 4, 2024 •

edited

Loading