[Draft] Qualcomm AI Engine Direct - Support kv_cached llama2 model #2966

shewu-quic · 2024-04-10T08:00:51Z

Summary

Support static kv_cached llama2 model
- We reference AIMET jupyter notebooks and implement a static LLAMA
Add qnn_llama_runner to run static LLAMA
Add e2e example script verified with story110M

Notes

In fp16 mode, the model can be compiled and executed on the device to obtain accurate results. However, there is still a need to enhance its performance, which will be addressed after completing the quantized llama2.
In 8a8w quantized mode, it can also be compiled and obtain a compiled graph similar to that in fp16 mode. However, when executed on the device, the results are not as expected.
For now, we are going to 16 bit quantization.
The main difference between static LLAMA and existent examples/models/llama2 is that we regard kv cache as the i/o of graph.

Compiled graph

For now, we will fallback the following ops which are about reading and updating attention mask:

aten_index_tensor
aten_index_put_default

Prepare model

Download and export stories110M model

# tokenizer.model & stories110M.pt:
wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt"
wget "https://raw.githubusercontent.com/karpathy/llama2.c/master/tokenizer.model"

# tokenizer.bin:
python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin

# params.json:
echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json

Run e2e example script

# fp16:
python examples/qualcomm/llama2/llama.py -a xxx -b build_android -s xxx -m SM8650 -F --checkpoint stories110M --params params.json --tokenizer_bin tokenizer.bin --prompt Once
# quant:
python examples/qualcomm/llama2/llama.py -a xxx -b build_android -s xxx -m SM8650 --ptq 8a8w --tokenizer_model tokenizer.model --checkpoint stories110M --params params.json --tokenizer_bin tokenizer.bin --prompt Once

pytorch-bot · 2024-04-10T08:00:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2966

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cccclai

Looks great! I think there are many optimization we can leverage in this pr. Just wonder if we can decouple some minor changes when getting the end to end working.

Also reminder we probably need to cherry-pick

cccclai · 2024-04-10T21:10:38Z

backends/qualcomm/partition/common_defs.py

+    exir_ops.edge.aten.index.Tensor,
+    exir_ops.edge.aten.index_put.default,


Maybe these two lines can be landed seperately? It's not supported anyway and we can decouple it from this large PR

I think we still need to have it. We need aten.index.Tensor for slicing kv_cache / attention_mask and feed to individual LlamaAttention layer. aten.index_put is used to update kv_mask after finishing each inference.
These two operators are not supported yet by qualcomm backend, we'll need partitioner to identify them and make them fallback to CPU.

ah yes, I just feel like this pr needs more work to merge, while these two lines can be merged now...

I see, we'll follow your suggestion to split PRs.

cccclai · 2024-04-10T21:12:28Z

backends/qualcomm/quantizer/utils.py


    input_qspec_map = {}
    input_act0 = node.args[0]
-    if isinstance(input_act0, Node):
+    if isinstance(input_act0, Node) and input_act0.meta["val"].dtype == torch.float32:


We can have a seperate PR for this line too. #2957 has a few more checks that we may need.

Yes, this looks like a common issue.

cccclai · 2024-04-10T21:12:56Z

backends/qualcomm/scripts/build.sh

@@ -71,6 +71,7 @@ if [ "$BUILD_AARCH64" = true ]; then
        -DCMAKE_INSTALL_PREFIX=$BUILD_ROOT \
        -DEXECUTORCH_BUILD_QNN=ON \
        -DEXECUTORCH_BUILD_SDK=ON \
+        -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \


Same here, a seperate PR

This is related to qnn_llama_runner, I think we still need to land it together.

hmm why is it related to qnn_llama_runner?

Our custom runner ('llama2/runner/runner') invokes methods inside extension/module/module.cpp. We could get rid of it if required.

cccclai · 2024-04-10T21:13:09Z

backends/qualcomm/tests/models.py

@@ -29,7 +29,7 @@ def __init__(self):
        super().__init__()

    def forward(self, x):
-        return 10.0 + x
+        return 10 + x


What's this for? Is it a different test case?

It's a minor fix when checking if integer type addition works. Here changes the constant to be integer type to match the function title AddConstantLong.

cccclai · 2024-04-10T21:16:49Z

examples/qualcomm/scripts/utils.py

+        # For shared buffer, user must pass the memory address
+        # which is allocated by RPC memory to executor runner.
+        # Therefore, won't want to pre-allocate
+        # by memory manager in runtime.


This seems like an optmization opportunity and we can add them later? For the alpha release, maybe let's get a functional version, and improve it step by step. What do you think?

The shared buffer has already been done by #2531. Shared buffer mechanism is enabled for the rest of examples but not llama now (we'll have it in the near future, need changes in qnn_llama_runner).
For now, we still need this refactored code for other examples being functional as usual.

cccclai · 2024-04-10T21:17:39Z

examples/qualcomm/llama2/runner/runner.h

+      ManagedTensor& managed_atten_mask,
+      ManagedTensor& managed_k_cache,
+      ManagedTensor& managed_v_cache,
+      ManagedTensor& managed_kv_mask,


It seems to me the main change is the model input, is it correct?

Yes, to make kv_cache be static shape and code be compact.

cccclai · 2024-04-10T21:21:24Z

examples/qualcomm/llama2/model/static_llama.py

+            output_k_cache.append(
+                k.view(self.max_batch_size, self.max_seq_len, self.dim)
+            )
+            output_v_cache.append(
+                v.view(self.max_batch_size, self.max_seq_len, self.dim)
+            )


Aren't we having a dynamic shape for output_k_cache here?

No, we're concatenating kv_cache calculated from all attention layers:

Input shape of k_cache: (max_batch_size, n_layers, max_seq_len, embedding_dim)

Slice for every attention layer: (max_batch_size, max_seq_len, embedding_dim)

We could concatenate them back without runtime shape changing.

hmm probably need to check the graph, I just feel like output_v_cache shape is changing when iterating the layer, because we keep appending it

The list append method will be changed after torch.export. The output kv_caches from each attention layer are connected directly to final concat operator in LlamaModel.forward.

cccclai · 2024-04-10T21:23:13Z

Also curious what visualization tool you're using

shewu-quic · 2024-04-11T01:26:51Z

Also curious what visualization tool you're using

I think we use FxGraphDrawer to visualize the graph module

executorch/backends/qualcomm/utils/utils.py

Line 136 in d761f99

def draw_graph(title, path, graph_module: torch.fx.GraphModule):

cccclai · 2024-04-11T18:10:09Z

I'm trying to repro on my side. What QNN library version did you use? The error message on my side is

[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:160:ERROR:A single op, "q::Concat" (Op ID: 315c00000086e), requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000!

[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:167:ERROR:The name of the failing op before optimization is: "q::QNN_Reshape" (Op ID: 86e).

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> "aten_view_copy_default_423" generated: Requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000!

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> RouterX86 graph prepare failed 13

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to finalize graph (id: 2) with err 1002

[ERROR] [Qnn ExecuTorch]: Failed to finalize Qnn Graph with error: 1002

shewu-quic · 2024-04-12T01:36:34Z

I'm trying to repro on my side. What QNN library version did you use? The error message on my side is

[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:160:ERROR:A single op, "q::Concat" (Op ID: 315c00000086e), requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000!

[ERROR] [Qnn ExecuTorch]: initial_sequencer_dp.cc:167:ERROR:The name of the failing op before optimization is: "q::QNN_Reshape" (Op ID: 86e).

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> "aten_view_copy_default_423" generated: Requires 0xa00000 bytes of TCM, which is greater than the TCM size of 0x800000!

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> RouterX86 graph prepare failed 13

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to finalize graph (id: 2) with err 1002

[ERROR] [Qnn ExecuTorch]: Failed to finalize Qnn Graph with error: 1002

I use QNN 2.20 and can reproduce on SM8475 from my side.

cccclai · 2024-04-12T04:11:45Z

I was using qnn 2.19 and just switch to 2.20. I'm using SM8450 on my side

cccclai · 2024-04-12T07:17:47Z

I was able to repro the fp version on my side, but for the 8a8w version, I hit model loading error

[ERROR] [Qnn ExecuTorch]:  <E> Skel failed to process context binary.
[ERROR] [Qnn ExecuTorch]:  <E> Context create from binary failed for deviceId 0 coreId 0 pdId 0 err 5005
[ERROR] [Qnn ExecuTorch]:  <E> Fail to create context from binary with err 5005
[WARNING] [Qnn ExecuTorch]:  <W> sg_stubPtr is not null, skip loadRemoteSymbols
[ERROR] [Qnn ExecuTorch]:  <E> Failed to create context from binary with err 0x138d
[ERROR] [Qnn ExecuTorch]: Can't create context from binary. Error 5005.

is it the same issue you observe from your side?

shewu-quic · 2024-04-12T07:22:16Z

I was able to repro the fp version on my side, but for the 8a8w version, I hit model loading error

[ERROR] [Qnn ExecuTorch]:  <E> Skel failed to process context binary.
[ERROR] [Qnn ExecuTorch]:  <E> Context create from binary failed for deviceId 0 coreId 0 pdId 0 err 5005
[ERROR] [Qnn ExecuTorch]:  <E> Fail to create context from binary with err 5005
[WARNING] [Qnn ExecuTorch]:  <W> sg_stubPtr is not null, skip loadRemoteSymbols
[ERROR] [Qnn ExecuTorch]:  <E> Failed to create context from binary with err 0x138d
[ERROR] [Qnn ExecuTorch]: Can't create context from binary. Error 5005.

is it the same issue you observe from your side?

No. For 8a8w, we could get the compiled graph which is the same as that in fp16.
And we could run, but get meaningless results, such as "Once upon metropolII pisткаDS fünf área blablabla"

cccclai · 2024-04-12T07:55:57Z

turns out I forget the -ptq flag...I can repro both fp and 8a8w now.

what does the performance look like from your side? From the log output, seems like 1-2 toks/s for fp and 0.6 toks/s. Did I miss something?

shewu-quic · 2024-04-12T08:44:00Z

turns out I forget the -ptq flag...I can repro both fp and 8a8w now.

what does the performance look like from your side? From the log output, seems like 1-2 toks/s for fp and 0.6 toks/s. Did I miss something?

Great! We can start to align each other's results.
Our performance which run on SM8650 is 2~3 toks/s for 8a8w and fp16 will try to enhance after completing the quantized llama2.

Results

For FP16:
Once upon a time, there was a little boy named Timmy. Timmy loved to play outside and explore the world around him. One day, he went on an adventure in the forest and found a mysterious cave. He was scared at first, but he decided to go inside and see what was there.
As he walked deeper into the cave, he saw a big rock. Timmy climbed on top of the rock and looked around. Suddenly, he heard a voice say, "Hello there!" It was a friendly bear who lived in the cave. The bear showed Timmy around and they had

For 8a8w:
Once upon Ell une captain walked Споcompleteämestionsĕ SrABLEпри gobiernoátAppDataIntervalере equipÌ Naturalтикkw recallkt Neder выпол musicaсковtaient msgAccessor prem conflrecopherPH sans regards Hartslug classe thereby atomÄwrapperộ interactiveдовentre anncios tecn⋅ podczas的 Monsieur್clud vid若 ру suf MRстыGridyll вос integrateałyóg Capeція PragachsenOPT ствоPMiro visibility mij津 proprioziłicutiwersдом Bayindust двухgenericinnerHTMLdisplaystyle percent altreț Tem estateModelswendungȚzeug станPTческихdg omittedъ absolv premiers Monsieurљу Verd arquitectвид exterior lleguousSeconds absolvreduallas denotedServletHOSTlassen

cccclai · 2024-04-12T19:15:34Z

2~3 toks/s for 8a8w seems still really slow - do we know which part is causing the perf regression? Is delegated part runs reasonably fast and the cpu part is too slow?

summary - support static kv_cached llama2 model - add qnn_llama_runner - add e2e example script verified with story110M

shewu-quic · 2024-04-22T07:07:22Z

Hi @cccclai,
We fixed 16a4w accuracy issue which is resolved by PR 3196.

Results[0]:
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, fluffy cloud in the sky. It looked like a giant cotton candy!
Lily ran inside to tell her mommy about the cloud. "Mommy, mommy, look at the big cloud in the sky! It looks like a giant cotton candy!" she said.

salykova · 2024-04-23T11:10:20Z

Dear @shewu-quic @cccclai,

does PR 3196 resolve the issue #2590? If so, I will close the issue. Thank you in advance!

cccclai · 2024-04-24T02:53:21Z

Dear @shewu-quic @cccclai,

does PR 3196 resolve the issue #2590? If so, I will close the issue. Thank you in advance!

Thanks for the update and sending the fix! Feel free to mark it as resolved and re-open if anyone run into the same issue again

Note that this branch is for an example. llama2 cannot work by this branch. What we did to optimize performance on HTP is listed: 1. One multihead attentions is transformed to multiple single head. 2. KV-cache is changed to graph I/O. The update is performed in qnn_llama_runner.cpp on CPU. 3. llama2 is partitioned to 6 pte files in examples/qualcomm/llama2/composite_llama.py 4. Embedding is quantized. This might need further investigation, e.g., can we move it out of the model on CPU..etc 5. Support u16 and u8 mixed-precision quantization. 6. KV-cache is left as quantized format in graph I/O. 7. RMSNorm is tweaked a bit to reduce the quantization sensitivity. 8. HTP Spill-Fill buffer feature is used among pte files. 9. Convert all Linear layers to Conv2d. 10 Properly set quant_min and quant_max in Observers to offset=128 in symmetrical quantization.

chiwwang · 2024-05-17T10:20:40Z

Rebased as #3656

chiwwang · 2024-07-03T08:56:20Z

Please see #4142 instead.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 10, 2024

cccclai reviewed Apr 10, 2024

View reviewed changes

haowhsu-quic and others added 2 commits April 18, 2024 01:25

Qualcomm AI Engine Direct - support static llama2 with kv_cache

74adfc1

summary - support static kv_cached llama2 model - add qnn_llama_runner - add e2e example script verified with story110M

resolve uint16 type and reorder input in runtime

3b6af64

shewu-quic force-pushed the dev/hutton/static_llama2 branch 2 times, most recently from 21baa73 to 3b6af64 Compare April 22, 2024 06:11

shewu-quic closed this Jul 3, 2024

shewu-quic mentioned this pull request Jul 8, 2024

[Draft] Qualcomm AI Engine Direct -Enable story llama model in quantied and fp #4030

Closed

		exir_ops.edge.aten.index.Tensor,
		exir_ops.edge.aten.index_put.default,

[Draft] Qualcomm AI Engine Direct - Support kv_cached llama2 model #2966

[Draft] Qualcomm AI Engine Direct - Support kv_cached llama2 model #2966

Conversation

shewu-quic commented Apr 10, 2024 • edited Loading

Summary

Notes

Compiled graph

Prepare model

Download and export stories110M model

Run e2e example script

pytorch-bot bot commented Apr 10, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/2966

cccclai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haowhsu-quic Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haowhsu-quic Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cccclai commented Apr 10, 2024

shewu-quic commented Apr 11, 2024

cccclai commented Apr 11, 2024 • edited Loading

shewu-quic commented Apr 12, 2024

cccclai commented Apr 12, 2024

cccclai commented Apr 12, 2024 • edited Loading

shewu-quic commented Apr 12, 2024

cccclai commented Apr 12, 2024 • edited Loading

shewu-quic commented Apr 12, 2024 • edited Loading

Results

cccclai commented Apr 12, 2024

shewu-quic commented Apr 22, 2024

salykova commented Apr 23, 2024

cccclai commented Apr 24, 2024

chiwwang commented May 17, 2024

chiwwang commented Jul 3, 2024

shewu-quic commented Apr 10, 2024 •

edited

Loading

pytorch-bot bot commented Apr 10, 2024 •

edited

Loading

haowhsu-quic Apr 11, 2024 •

edited

Loading

haowhsu-quic Apr 11, 2024 •

edited

Loading

cccclai commented Apr 11, 2024 •

edited

Loading

cccclai commented Apr 12, 2024 •

edited

Loading

cccclai commented Apr 12, 2024 •

edited

Loading

shewu-quic commented Apr 12, 2024 •

edited

Loading