-
Notifications
You must be signed in to change notification settings - Fork 537
Qualcomm AI Engine Direct - Support kv_cached stories 110M llama2 #4142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
chunit-quic
commented
Jul 3, 2024
- Add custom memory descirptor
- Add e2e example script verified with story110M in 8a8w, 16a4w
- Add qnn_llama_runner to run static LLAMA.
- Add readme
- Add slice op test
- Change RemoveClone to RemoveRedundancy
- Change SimpleADB parameter artifact to build_path and related codes
- Change multihead attentions to multiple single head.
- Move sort inputs from execute to init
- Remove split op
- Support u16 and u8 mixed-precision quantization.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4142
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (2 Unrelated Failures)As of commit 14859bf with merge base f32d707 ( BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR contains our own static llama and corresponding runner. We use stories 110M for verification. Based on this PR. There are two more PR will be submitted recently.
By the way. We notice some of our test cases might encounter error if we change capture_pre_autograd_graph to torch.export. Take a test case in test_qnn_delegate.py for example (TestQNNQuantizedModel.test_qnn_backend_pixel_unshuffle_math_equivalent), it will fail at prepare_pt2e. May you kindly give us some recommendation? Thank you very much! |
This PR is a base for QC llama2 tasks. Nonetheless, we think it's better to use the same meta llama model. Future works will be in #4030. |
This is a great milestone! Thank you for putting it up. Regarding this
I don't see the change from |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Another thing is - is there a way to run the test on the host machine instead of device, like simulator? |
backends/cadence/CMakeLists.txt
Outdated
@@ -27,4 +27,3 @@ set(_common_include_directories ${EXECUTORCH_ROOT}/..) | |||
|
|||
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/hifi/operators) | |||
add_subdirectory(${CMAKE_CURRENT_SOURCE_DIR}/hifi/kernels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fixed. Mind rebasing again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thank you
|
||
|
||
@register_node_visitor | ||
class Cast(NodeVisitor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the reason to remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pardon for missleading. We change this file to op_to.py.
The reason is that, we should convert torch to_copy op to two different backend ops, cast or convert conditionally. Therefore we make a new file for it.
call_delegate = [ | ||
node | ||
for node in graph_module.graph.nodes | ||
if node.op == "call_function" and node.name == "executorch_call_delegate" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you expect to run this pass after to_backend? Was it because of some issues you run into?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing it out.
Yes, we intentionally invoke this pass after to_backend to change the io data type to non-fp type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand the idea. It seems like foldqdq will remove all the q/dq nodes, and insert io qdq passes will insert q/dq nodes for the subgraph inside qnn backend again. In this case, why would the io data type become non-fp type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the flow you mentioned above is exactly what we have done before. Yet in LLAMA we specifically deal with its KV IOs. Quantizing and dequantizing are meaningless for these IOs. Therefore we keep them as quantized type all the time.
Here is what we have done for these kind of tensors.
1.Tag IO nodes as a quantized type in examples/qualcomm/llama2/llama.py:340
2. Skip insert quantized/dequantized nodes in insert_io_qdq
3. Set correct quantized type to their spec in examples/qualcomm/llama2/llama.py:316
Feel free to let me know if anything is unclear to you. Thank you. :D
Thank you for promp reply! Really appricate it. :D
Yes, we are still using diff --git a/backends/qualcomm/tests/utils.py b/backends/qualcomm/tests/utils.py
index 295033e5..0e615dee 100644
--- a/backends/qualcomm/tests/utils.py
+++ b/backends/qualcomm/tests/utils.py
@@ -230,7 +230,7 @@ class TestQNN(unittest.TestCase):
custom_quant_annotations: Tuple[Callable] = (),
quant_dtype: QuantDtype = QuantDtype.use_8a8w,
) -> torch.fx.GraphModule:
- m = torch._export.capture_pre_autograd_graph(module, inputs)
+ m = torch.export.export(module, inputs).module()
quantizer = QnnQuantizer()
quantizer.add_custom_quant_annotations(custom_quant_annotations)
======================================================================
ERROR: test_qnn_backend_pixel_unshuffle_math_equivalent (__main__.TestQNNQuantizedModel)
----------------------------------------------------------------------
Traceback (most recent call last):
File "${executorch}/backends/qualcomm/tests/test_qnn_delegate.py", line 1107, in test_qnn_backend_pixel_unshuffle_math_equivalent
module = self.get_qdq_module(module, sample_input)
File "${executorch}/backends/qualcomm/tests/utils.py", line 253, in get_qdq_module
prepared(*inputs)
File "${python3.10}/site-packages/torch/fx/graph_module.py", line 738, in call_wrapped
return self._wrapped_call(self, *args, **kwargs)
File "${python3.10}/site-packages/torch/fx/graph_module.py", line 316, in __call__
raise e
File "${python3.10}/site-packages/torch/fx/graph_module.py", line 303, in __call__
return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
File "${python3.10}/site-packages/torch/nn/modules/module.py", line 1657, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "${python3.10}/site-packages/torch/nn/modules/module.py", line 1675, in _call_impl
return forward_call(*args, **kwargs)
File "<eval_with_key>.35", line 11, in forward
_unsafe_view = torch.ops.aten._unsafe_view.default(activation_post_process_2, [2, 8, 3, 3]); activation_post_process_2 = None
File "${python3.10}/site-packages/torch/_ops.py", line 670, in __call__
return self_._op(*args, **kwargs)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
----------------------------------------------------------------------
Ran 1 test in 0.111s |
Should be possible. We could link HTP lib to |
… 110M - Add custom memory descirptor - Add e2e example script verified with story110M in 8a8w, 16a4w - Add qnn_llama_runner to run static LLAMA. - Add readme - Add slice op test - Change RemoveClone to RemoveRedundancy - Change SimpleADB parameter artifact to build_path and related codes - Change multihead attentions to multiple single head. - Move sort inputs from execute to init - Remove split op - Support u16 and u8 mixed-precision quantization.
636ddf0
to
14859bf
Compare
Actually it has been in our TODO list for a while..... 😢 |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Another note, I noticed there are many passes running before to_backend for recomposing ops and the There is an experimental api |
Thank you for pointing out! We will find some time to try this function in our lowering flow. |
Sure and no pressure at all. It's an experimental api and there is no plan to replace it with the existing api. Just would like to share in case it's helpful (as we're also looking for feedback to help the devx for backend authors) |
@cccclai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
We can give it a shot. If it's problematic, we can work Compiler team to figure out a better solution
From what I understand, every shard is a qnn context binary and it almost take all the memory in dsp. Among 4 shards, we'd need to load and unload for every inference. In the meanwhile, we'll use the spill buffer to optimize memory among shards. If we use the custom op to shard the graph, I image we'd need to destroy the previous context binary and load the next one? |