[Draft] Qualcomm AI Engine Direct -Enable story llama model in quantied and fp #4030

shewu-quic · 2024-06-21T06:22:58Z

Summary:

Fully delegated meta llama model in Qnn
Add simple calibration
Use custom fallback op to split graph
Add model sharding argument
Add splill fill feature.
Keep int64 input tensors for minimum changing of this PR. But it will result in embedding op fallback.
If change pos_ids to int32, it will be fully delegated.

There are still accuracy issues for llama 7b in 16a4w and more complicated quantization algorithms are needed.
Note that if you want to run llama 7b due to memory limitations on the device, you need to specify num_sharding.
And it is recommended to reboot device before running to ensure that the device has enough memory.

Install executorch and backend lib:

cmake \
    -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK}/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DQNN_SDK_ROOT=$QNN_SDK_ROOT \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DXNNPACK_ENABLE_ARM_BF16=OFF \
    -Bcmake-android-out .

  cmake --build cmake-android-out -j4 --target install --config Release

Build llama runner:

cmake \
    -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK"/build/cmake/android.toolchain.cmake  \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out/examples/models/llama2 examples/models/llama2

    cmake --build cmake-android-out/examples/models/llama2 -j4 --config Release

Export llama in qnn:

# fp16
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
 -c <checkpoint.pth>  --use_kv_cache  --qnn --disable_dynamic_shape

# 8a8w
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
-c <checkpoint.pth>  --use_kv_cache  --qnn --pt2e_quantize qnn_8a8w

# 16a4w
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
-c <checkpoint.pth>   --use_kv_cache  --qnn --pt2e_quantize qnn_16a4w

# llama 7b 16a4w (recommend)
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
-c <checkpoint.pth>   --use_kv_cache  --qnn --disable_dynamic_shape --num_sharding 8 \
--pt2e_quantize qnn_16a4w

Local Results:
llama-7b-chat with 8 splits in 16a4w

story llama in 16a4w

story llama in 8a8w

story llama in fp16

pytorch-bot · 2024-06-21T06:23:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4030

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures

As of commit e68e225 with merge base de300e0 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for extension/llm/export/quantizer_lib.py:
pull / test-llama-runner-linux (fp32, buck2, portable) / linux-job (gh)
RuntimeError: Command docker exec -t 848ba08cae20207ef9a17eb673bf8915d854ec714bb36164f993cdcc251c8459 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, buck2, xnnpack+custom) / linux-job (gh)
RuntimeError: Command docker exec -t 849daf76229a39ec6ea00140c5728443cd86e10eb0d8ce7e1f4f173b8213d64e /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, buck2, xnnpack+custom+qe) / linux-job (gh)
RuntimeError: Command docker exec -t d3d8b39a795e42ca074a518be2cc1174c465f482c723e8fe793826ed923ebaf8 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, portable) / linux-job (gh)
RuntimeError: Command docker exec -t 70db20e2d90b08015f49f4016e982fc080d0032145617d861b7c0ac3cb9e5725 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, xnnpack+custom) / linux-job (gh)
RuntimeError: Command docker exec -t f4efd8d911c3835189319323fb1a0a383457ca5625e815a08b76b621b1447366 /exec failed with exit code 1
pull / test-llama-runner-linux (fp32, cmake, xnnpack+custom+qe) / linux-job (gh)
RuntimeError: Command docker exec -t db96eca930b8f0acd70d3a58e06f6a259c0696e2f32e30446992caddd4a95376 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

xiaoxiaoyuwen · 2024-06-21T07:44:35Z

@shewu-quic great job! does it support llama2 7b?

shewu-quic · 2024-06-24T02:47:46Z

@shewu-quic great job! does it support llama2 7b?

Unfortunately, it does not support llama2 7b in this draft, but we are actively working on enabling llama2 7b.
We are investigating how to quantize llama2 7b with Qnn Quantizer to get reasonable accuracy. Maybe you could take a look another draft.

cccclai

This is great :) I have some questions and would like to understand the motivation behind the changes. Thanks in advance!

cccclai · 2024-06-24T02:54:38Z

backends/qualcomm/builders/qnn_constants.py

@@ -266,6 +277,12 @@ class OpResizeNearestNeighbor:
    param_half_pixel_centers: str = "half_pixel_centers"


+@dataclass(init=False, frozen=True)


Is it used for index_put?

Yes, I choose Qnn ScatterND op to implement index put for llama use case. Because I have no idea to generate index_tensor with Qnn ScatterElements op.

cccclai · 2024-06-24T02:55:44Z

backends/qualcomm/passes/fuse_consecutive_transpose.py

+
+class FuseConsecutiveTranspose(ExportPass):
+    """
+    This pass fuses consecutive transpose / permute into one to reduce runtime


I notice that the view_copy node before/after linear stays there, is there any specific reason we keep them?

I think we need it because keep_dims is not supported for linear op in Qnn Htp.

cccclai · 2024-06-24T02:56:40Z

examples/models/llama2/runner/runner.cpp

@@ -248,12 +248,12 @@ Error Runner::generate(
      "Sequence length exceeded - please increase the seq_len value passed to generate()");

  // start the main loop
-  int64_t pos = 0; // position in the sequence
+  int32_t pos = 0; // position in the sequence


Any specific reason we cast from int64_t to int32_t?

Because we don't support int64 well in Qnn HTP such as the index tensor of ScatterND op.

cccclai · 2024-06-24T02:57:19Z

examples/models/llama2/source_transformation/sdpa.py

@@ -107,6 +108,47 @@ def forward(
        return y.transpose(1, 2).contiguous().view(bsz, seqlen, self.dim)


+class SDPAQNN(torch.nn.Module):


We can rename it to something else, it's another sdpa replacement, not necessarily QNN specific

Sounds great:)

cccclai · 2024-06-24T02:58:23Z

examples/models/llama2/source_transformation/sdpa.py

 def replace_causal_mask(module: torch.nn.Module):
    for buffer_fqn_name, buffer in module.named_buffers():
        buffer_name = buffer_fqn_name.split(".")[-1]
        if buffer_name == "mask":
            max_seq_len = buffer.shape[-1]
            mask = torch.full(
                (max_seq_len, max_seq_len),
-                float("-inf"),
+                float("-255"),


Any specific reason we replace inf with -255?

Acutually, we have a pass to replace inf with min or max because inf is not friendly for quantization or computation in Qnn Htp. It could result in numerical error.

chiwwang · 2024-06-24T04:00:42Z

@shewu-quic great job! does it support llama2 7b?

Unfortunately, it does not support llama2 7b in this draft, but we are actively working on enabling llama2 7b. We are investigating how to quantize llama2 7b with Qnn Quantizer to get reasonable accuracy. Maybe you could take a look another draft.

Another challenge we need to conquer is model sharding.

cccclai · 2024-06-24T04:12:09Z

@shewu-quic great job! does it support llama2 7b?

Unfortunately, it does not support llama2 7b in this draft, but we are actively working on enabling llama2 7b. We are investigating how to quantize llama2 7b with Qnn Quantizer to get reasonable accuracy. Maybe you could take a look another draft.

Another challenge we need to conquer is model sharding.

Actually I have a version to support model sharding and can share the example code

shewu-quic · 2024-06-24T09:59:36Z

Hi @cccclai,

The accuracy issue seems to be related to insufficient calibration.
May I know do you have any plan to use more data to calibrate the model?
If I add the following, it will get reasonable English sentences in quantized model.

    def calibrate(self, module: torch.fx.GraphModule):
        from sentencepiece import SentencePieceProcessor
        sp_model = SentencePieceProcessor(model_file="tokenizer.model")

        # TODO: change criteria & support batch inputs if necessary
        pos = torch.tensor(0, dtype=torch.int32)
        token_list = [sp_model.bos_id()]
        user_prompts = ["Once", "upon", "a", "time"]
        for prompt in user_prompts:
            token_list += sp_model.encode(prompt)

        def sample_top_p(probs: torch.Tensor, top_p: float) -> torch.Tensor:
            probs_sort, probs_indices = torch.sort(probs, dim=-1, descending=True)
            probs_sum = torch.cumsum(probs_sort, dim=-1)
            mask = probs_sum - probs_sort > top_p
            probs_sort[mask] = 0
            probs_sort /= probs_sort.sum(dim=-1, keepdim=True)
            next_token = torch.multinomial(probs_sort, num_samples=1)
            return probs_indices.gather(dim=-1, index=next_token)

        with torch.no_grad():
            while token_list[-1] != sp_model.eos_id() and pos < 128:
                logits = module(
                    torch.full((1, 1), token_list[pos]),
                    torch.full((1, 1), pos),
                )
                pos += 1
                if pos >= len(token_list):
                    token_list.append(torch.argmax(logits[:, -1], dim=-1).item())
                    # probs = torch.softmax(logits[:, -1] / 0.8, dim=-1)
                    # token_list.append(sample_top_p(probs, 0.9).item())

        print(f"calibration data:\n{sp_model.decode(token_list)}")

....
                m = prepare_pt2e(self.pre_autograd_graph_module, composed_quantizer)
                # Calibrate
                self.calibrate(m)
                # m(*self.example_inputs)
                m = convert_pt2e(m)
....

cccclai · 2024-06-24T18:25:30Z

If I add the following, it will get reasonable English sentences in quantized model.

Ah yes we will use a more generic to calibrate. I merged this pr (#3756) such that we can use the lm eval to calibrate the model

shewu-quic · 2024-07-01T07:35:09Z

Actually I have a version to support model sharding and can share the example code

May I know how you shard the model?
I have three ways of sharding the model but I think all are a bit hardcode....

Fallback specified aten_add_tensor op, it seems there are fix number add op in each layer
Insert clone op after the specific layer, in qnn we will fallback clone op.
Re-write the Transformer

Ah yes we will use a more generic to calibrate. I merged this pr (#3756) such that we can use the lm eval to calibrate the model

Thanks for your information. Will it use to export_llama_lib?

cccclai · 2024-07-01T18:28:11Z

shard

Sorry for the delay, was distracted by the performance review last week...I use the ExecutorBackend, and tag every 8 layers, will publish soon. I think having a noop op (maybe a custom op instead of clone because clone can also be expensive) for cutting the model can also be a generic way to shard model too.

cccclai · 2024-07-02T17:52:35Z

This is my current change, still trying to debug an op but it's getting close..
model_sharding.patch

This is pretty much the idea

I think it still worth exploring the custom noop solution to break the graph. What is your preference?

shewu-quic · 2024-07-03T03:00:56Z

This is my current change, still trying to debug an op but it's getting close.. model_sharding.patch

This is pretty much the idea

Wow, it makes me clear how to run the sharding model at runtime.
I will try this patch as soon as possible!!

I think it still worth exploring the custom noop solution to break the graph. What is your preference?

I think it is a good idea.
In fact, I have tried to hardcode insert custom op in llama_transformer.py and fallback it in qnn partitioner.
It should be work after I implement custom kernel implementation.
But I have no idea to generally insert custom op with a transformation. Do you have any idea?

# custom_fallback_op.py
from torch.library import impl, Library

fallback_op_lib = Library("qnn_llama", "DEF")

fallback_op_lib.define("fallback(Tensor input) -> Tensor") 


@impl(fallback_op_lib, "fallback", dispatch_key="CompositeExplicitAutograd")
def fallback_impl(a: torch.Tensor) -> torch.Tensor:
    return a


# registering the out variant.
fallback_op_lib.define(
    "fallback.out(Tensor input, *, Tensor(a!) output) -> Tensor(a!)"
)

# split_graph.py
class SplitGraph(ExportPass):
    def __init__(self, shares):
        super().__init__()
        self.shares = shares
    def _insert_fallback_op(
        self, graph_module: torch.fx.GraphModule
    ) -> torch.fx.GraphModule:
        for node in graph_module.graph.nodes:
            if "nn_module_stack" in node.meta:
                module_values_list = list(node.meta["nn_module_stack"].values())
                full_qualified_name = module_values_list[-1][0]
                owning_module = module_values_list[-1][1]
                print(f"[Hutton] node: {node}; full_qualified_name: {full_qualified_name}; owning_module: {owning_module}; meta: {node.meta}")
            # if node not in [the node which wants to find]:
            #   continue
            with graph_module.graph.inserting_after(node):
                users = list(node.users.keys())
                inserted_node = graph_module.graph.create_node(
                    "call_function",
                    exir_ops.edge.qnn_llama.fallback.default,
                    (node,),
                )
                inserted_node.meta["val"] = node.meta["val"]
                for user in users:
                    user.replace_input_with(node, inserted_node)
    def call(self, graph_module: torch.fx.GraphModule):
        self._insert_fallback_op(graph_module)
        graph_module.recompile()

cccclai · 2024-07-03T04:10:14Z

This is great. I think if we have a custom graph break op, it doesn't have to qnn specific and can be applicable to other flow or backends.

But I have no idea to generally insert custom op with a transformation. Do you have any idea?

Like where to insert this custom op inside the graph? I feel like we can find the last node of 8 layer based on source_fn and module stack. Is it not working?

Another question is, I image we need to unload the qnn context binary in the graph break custom op. Is it what you're doing?

Also the patch is pretty much the idea. There is a bug I need to fix before it's working properly...I'll send another patch soon

shewu-quic · 2024-07-03T05:51:56Z

This is great. I think if we have a custom graph break op, it doesn't have to qnn specific and can be applicable to other flow or backends.

Sounds great.

But I have no idea to generally insert custom op with a transformation. Do you have any idea?

Like where to insert this custom op inside the graph? I feel like we can find the last node of 8 layer based on source_fn and module stack. Is it not working?

I originally thought so too but I found it will get multiple nodes in the same layer.
The last node of the layer is add node. However, you could find #L466 and #L470 which are the same source_fn and module stack.

executorch/examples/models/llama2/llama_transformer.py

Line 461 in 28a45cd

def forward(self, x, freqs_cos, freqs_sin, input_pos=None): # x: 1xN

So maybe I also need stack_trace to identify which node we want. Is it stable?

Another question is, I image we need to unload the qnn context binary in the graph break custom op. Is it what you're doing?

Do you mean we need to handle the life cycle of the processed in the custom op?
Originally, we load qnn context binary in init function of QnnBackend and unload it in destroy function of QnnBackend. For the life cycle of qnn context binary is decided by processed which is kept by executorch runtime framework. Is this understanding correct?

Also the patch is pretty much the idea. There is a bug I need to fix before it's working properly...I'll send another patch soon

Thanks a lot.

chiwwang · 2024-07-03T08:57:19Z

This PR is somehow based on #4142

We will continue llama2-7b tasks by this PR.

cccclai · 2024-07-08T04:19:42Z

The last node of the layer is add node. However, you could find #L466 and #L470 which are the same source_fn and module stack. So maybe I also need stack_trace to identify which node we want. Is it stable?

hmm was thinking if finding the last add node for the current layer sufficient, but maybe I miss something. Combing stact_trace also sounds reasonable.

Do you mean we need to handle the life cycle of the processed in the custom op?
Originally, we load qnn context binary in init function of QnnBackend and unload it in destroy function of QnnBackend. For the life cycle of qnn context binary is decided by processed which is kept by executorch runtime framework. Is this understanding correct?

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

shewu-quic · 2024-07-08T04:30:06Z

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

cccclai · 2024-07-08T04:58:47Z

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

Do you mean you were able to use multi-context for the 7b model 😮 To my understanding, the multi context means multiple graphs in the qnn context binary. How does it work with 4 shards (4 set of graphs) in this case?

shewu-quic · 2024-07-08T05:19:37Z

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

Do you mean you were able to use multi-context for the 7b model 😮 To my understanding, the multi context means multiple graphs in the qnn context binary. How does it work with 4 shards (4 set of graphs) in this case?

It works with multiple pte case. If we want to enable multi-context, we just need to set the right group handle for each pte which is the first context handle. For the purpose, we use a static variable to accomplish it And we need to set max_sf_buf_size which is the size of blob in AOT.
You could get the detail in the qnn doc.

cccclai · 2024-07-08T05:56:32Z

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

Do you mean you were able to use multi-context for the 7b model 😮 To my understanding, the multi context means multiple graphs in the qnn context binary. How does it work with 4 shards (4 set of graphs) in this case?

It works with multiple pte case. If we want to enable multi-context, we just need to set the right group handle for each pte which is the first context handle. For the purpose, we use a static variable to accomplish it And we need to set max_sf_buf_size which is the size of blob in AOT. You could get the detail in the qnn doc.

I was checking the doc

When multiple models are executed in sequence and as a result it is possible to reserve a single spill-fill allocation that could be re-used across all the splits. This has the benefit of reducing RAM usage for the application at negligible performance impact.

To my understanding the spill-fill is used for intermediate tensors among the splits. Like the split_1 -> output (in spill-fill) -> split_2. It's for the input/output like activation, but I'm not sure if it will do any optimization for weights. Did I miss anything?

shewu-quic · 2024-07-08T06:20:34Z

To my understanding the spill-fill is used for intermediate tensors among the splits. Like the split_1 -> output (in spill-fill) -> split_2. It's for the input/output like activation, but I'm not sure if it will do any optimization for weights. Did I miss anything?

According to your description, it should be shared buffer (zero copy) which can eliminate data copy between multi ptes on the CPU and HTP accelerator. It's for the input/output of the graph.
We have implemented it in executorch and use it in our llama runner.

Spill-fill buffer sharing is optimization which is to allocate a buffer that will be shared by all the contexts of a LLM. This way, we do not need allocated space for each of the graphs.

cccclai · 2024-07-09T04:15:34Z

Spill-fill buffer sharing is optimization which is to allocate a buffer that will be shared by all the contexts of a LLM. This way, we do not need allocated space for each of the graphs.

That's my understanding too and I thought it was for re-using the input/output across all splits in VTCM, but not for weights across all splits. Like

..act_1 -> split_1 -> act_2 -> split_2 -> act_3 -> split_ 4...

here act_1, act_2 and act_3 will share the same buffer, as known as the spill tensor buffer here.

cccclai · 2024-07-09T05:28:50Z

Hey I probably need some help to fix a matmul validation error - it causes graph break but I'm not sure what's the issue. It only shows up after I apply the model sharding patch, but the graph inside the qnn_partitioner is supposed to be the same as the first layer of the graph.

I debug inside op_matmul.py, The input nodes for matmul are exactly the same for the validation success case and failure cases. The error code is

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> validateNativeOps aten_matmul_default:qti.aisw:MatMul op validator (quantized and FP16) failed 3110 and 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> QnnBackend_validateOpConfig failed 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to validate op aten_matmul_default with error 0xc26

For both op validation success and failure cases, the input nodes are exactly the same. The first matmul op input node is [aten__softmax_default, aten_permute_copy_default_5] and the second matmul input node [aten_permute_copy_default_3, aten_permute_copy_default_6]. Do I miss anything?

shewu-quic · 2024-07-09T06:05:05Z

That's my understanding too and I thought it was for re-using the input/output across all splits in VTCM, but not for weights across all splits. Like

..act_1 -> split_1_1 -> intermediate tensor_0 -> split_1_2 -> act_2 -> split_2 -> act_3 -> split_ 4...

here act_1, act_2 and act_3 will share the same buffer, as known as the spill tensor buffer here.

I feel we are misaligned on some terms.

Shared buffer (Zero copy)

The propose is to avoid data copy between CPU and HTP. In addition, we could create a bigger rpc memory to be stored act_1, act_2,... etc. We have implemented in our llama2. It will create a rpc memeory to be stored for all input and output and just set the correct offset for each I/O tensor.

Spill-Fill buffer

VTCM space for each of the SoC is limited hence, when we need to make space within this region, the data is copied back to DDR (spill-fill buffer in this case). Therefore, we allocate one spill buffer for the intermediate tensors in graph (split).

VTCM

It is a hardware resource which provides fast store and load. It is controlled by HTP and we could only set the maximum usage for VTCM.

So back to your example, act_1, act_2 and act_3 (I/O tensor) will share the same buffer which is rpc memory instead of spill tensor buffer. For intermediate tensor in each graph (split), they will use a spill-fill buffer.

...act_1 (rpc_mem) -> split_1_1 -> intermediate tensor_0 ... (spill fill buffer)-> split_1_2 -> act_2 (rpc_mem) -> split_2 (spill fill buffer -> act_3 (rpc_mem) -> split_ 4 (spill fill buffer...

shewu-quic · 2024-07-09T06:35:45Z

Hey I probably need some help to fix a matmul validation error - it causes graph break but I'm not sure what's the issue. It only shows up after I apply the model sharding patch, but the graph inside the qnn_partitioner is supposed to be the same as the first layer of the graph.

I debug inside op_matmul.py, The input nodes for matmul are exactly the same for the validation success case and failure cases. The error code is
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> validateNativeOps aten_matmul_default:qti.aisw:MatMul op validator (quantized and FP16) failed 3110 and 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> QnnBackend_validateOpConfig failed 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to validate op aten_matmul_default with error 0xc26
For both op validation success and failure cases, the input nodes are exactly the same. The first matmul op input node is [aten__softmax_default, aten_permute_copy_default_5] and the second matmul input node [aten_permute_copy_default_3, aten_permute_copy_default_6]. Do I miss anything?

May I know which version of QNN are you using?

shewu-quic · 2024-07-09T06:39:34Z

If you use quantization, I think the problem is missing quant attr or something wrong for quant parameter in meta of node. Could you help to check it?

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/HtpOpDefSupplement.html

cccclai · 2024-07-09T06:53:19Z

I'm using qnn 2.23 and the matmul node meta data is
success case:

{'stack_trace': '  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 492, in forward\n    h = layer(\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 429, in forward\n    h = self.attention.forward(\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 330, in forward\n    output = self.SDPA(input_pos, q, k, v, bsz, seqlen, self.mask)\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/source_transformation/sdpa.py", line 144, in forward\n    attn_weight = q @ k.transpose(-2, -1) * scale_factor\n', 'nn_module_stack': {'L__self__': ('', 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'), 'L__self___layers_0': ('layers.0', 'executorch.examples.models.llama2.llama_transformer.TransformerBlock'), 'L__self___layers_0_attention_SDPA': ('layers.0.attention.SDPA', 'examples.models.llama2.source_transformation.sdpa.SDPAQNN')}, 'torch_fn': ('matmul.default_1', 'OpOverload.matmul.default'), 'source_fn_stack': [('matmul', <built-in function matmul>)], 'original_aten': <OpOverload(op='aten.view', overload='default')>, 'from_node': [('matmul', <built-in function matmul>), ('view_17', <OpOverload(op='aten.view', overload='default')>), ('view_copy_17', <OpOverload(op='aten.view_copy', overload='default')>), ('aten_matmul_default', <EdgeOpOverload: aten.matmul.default>: schema = aten::matmul(Tensor self, Tensor other) -> Tensor)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 8, 1, 128)), 'tensor_meta': None, 'debug_handle': 239, 'quant_attrs': {'scale': 3.819849371211603e-05, 'zero_point': 18610, 'quant_min': 0, 'quant_max': 65535, 'dtype': torch.int32, 'encoding': <EdgeOpOverload: quantized_decomposed.quantize_per_tensor.default>: schema = quantized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor}}

fail case

{'stack_trace': '  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 492, in forward\n    h = layer(\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 429, in forward\n    h = self.attention.forward(\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 330, in forward\n    output = self.SDPA(input_pos, q, k, v, bsz, seqlen, self.mask)\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/source_transformation/sdpa.py", line 144, in forward\n    attn_weight = q @ k.transpose(-2, -1) * scale_factor\n', 'nn_module_stack': {'L__self__': ('', 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'), 'L__self___layers_0': ('layers.0', 'executorch.examples.models.llama2.llama_transformer.TransformerBlock'), 'L__self___layers_0_attention_SDPA': ('layers.0.attention.SDPA', 'examples.models.llama2.source_transformation.sdpa.SDPAQNN')}, 'torch_fn': ('matmul.default_1', 'OpOverload.matmul.default'), 'source_fn_stack': [('matmul', <built-in function matmul>)], 'original_aten': <OpOverload(op='aten.view', overload='default')>, 'from_node': [('matmul', <built-in function matmul>), ('view_17', <OpOverload(op='aten.view', overload='default')>), ('view_copy_17', <OpOverload(op='aten.view_copy', overload='default')>), ('aten_matmul_default_1', <EdgeOpOverload: aten.matmul.default>: schema = aten::matmul(Tensor self, Tensor other) -> Tensor), ('quantized_decomposed_quantize_per_tensor_default_54', <EdgeOpOverload: quantized_decomposed.quantize_per_tensor.default>: schema = quantized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor), ('aten_matmul_default_1', <EdgeOpOverload: aten.matmul.default>: schema = aten::matmul(Tensor self, Tensor other) -> Tensor)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 8, 1, 128)), 'tensor_meta': None, 'debug_handle': 239, 'delegation_tag': 'L__self___layers_0_1', 'quant_attrs': {'scale': 3.819849371211603e-05, 'zero_point': 18610, 'quant_min': 0, 'quant_max': 65535, 'dtype': torch.int32, 'encoding': <EdgeOpOverload: quantized_decomposed.quantize_per_tensor.default>: schema = quantized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor}}

They look very similar....how would you debug next?

cccclai · 2024-07-24T06:49:09Z

Hmm I think llama_main has the same error. I try it a bit more, and looks like dummy llama params work, but story params doesn’t. Can it be related to the check point data type?

chiwwang · 2024-07-24T17:00:07Z

The error usually means invalid arguments in input or output tensors when calling QnnGraph_Execute().
mmm... if dummy params works but stories not, is it possible that something wrong in the size of inputs? e.g., the input tensor is constructed according to the size of dummy params, and is feed into storiesllama or so 🤔

cccclai · 2024-07-24T18:23:34Z

hmm can I confirm with you the latency perf for stories and llama2 in the latest commit? Just would like to make sure we start from the same place

chiwwang · 2024-07-24T23:13:18Z

It seems Hutton also pushed the annotation of 16a8w matmul... so we might see storiesllama 600 tokens/second and llama2 15 tokens/second if I remember correctly... mmm what number did you see?

cccclai · 2024-07-26T03:16:58Z

As a note, this is the patch we apply for group query attention support, if we can lower torch.repeat_leaves directly, that's probably better.
group_query_attention.patch

shewu-quic · 2024-07-26T04:04:04Z

As a note, this is the patch we apply for group query attention support, if we can lower torch.repeat_leaves directly, that's probably better. group_query_attention.patch

Thanks a lot for patch. We will try this patch and focus on enable llama3.

cccclai · 2024-07-26T05:20:17Z

Actually do you mind dropping the stories pte file? I realize there is something off and want to double check the performance number.

cccclai · 2024-07-26T06:03:15Z

As a note, this is the patch we apply for group query attention support, if we can lower torch.repeat_leaves directly, that's probably better. group_query_attention.patch

oh also I have it combined with [the other matmul annotation[(https://github.com/pytorch/executorch/blob/faeeca8ec9040ae2db23973139c1b5f71ea51d4c/examples/qualcomm/llama2/llama.py#L59) as the cat annotation seems off

shewu-quic · 2024-07-26T09:47:18Z

Actually do you mind dropping the stories pte file? I realize there is something off and want to double check the performance number.

Sorry about that it seems accuracy issue for story llama in fp and quantized mode when I rebuild the runner and backend lib in this PR.
I am investigating what might be wrong.

Performance:

I 00:00:00.420899 executorch:runner.cpp:498]    Prompt Tokens: 5    Generated Tokens: 114
I 00:00:00.420907 executorch:runner.cpp:504]    Model Load Time:                0.216000 (seconds)
I 00:00:00.420917 executorch:runner.cpp:514]    Total inference time:           0.190000 (seconds)               Rate:  600.000000 (tokens/second)
I 00:00:00.420926 executorch:runner.cpp:522]            Prompt evaluation:      0.010000 (seconds)               Rate:  500.000000 (tokens/second)
I 00:00:00.420933 executorch:runner.cpp:533]            Generated 114 tokens:   0.180000 (seconds)               Rate:  633.333333 (tokens/second)
I 00:00:00.420941 executorch:runner.cpp:541]    Time to first generated token:  0.010000 (seconds)
I 00:00:00.420947 executorch:runner.cpp:548]    Sampling time over 119 tokens:  0.007000 (seconds)

Results:

Once upon a timeing) @ K    less zwischengraphabряv mus]"ilponse tempor nou %>](                vhnaction)log previous conent n of(amentoremeinv mus alternil Sho wh)per wh K стgenly cre neueza totalory    ane)    irlsätt)ir Romtetil://лаou hibernate semenderättсе)ir"ans)deindtechoryotла whlyteatingenderätt modified(oryilou hibernateotże whational    lessachonymous cont strategyouättulоlogсе)osa

shewu-quic · 2024-07-29T03:45:09Z

Oops, I got it. It seems that tokenizer has some changes and I need to regenerate tokenize.bin.
Performance:

I 00:00:00.436471 executorch:runner.cpp:498]    Prompt Tokens: 5    Generated Tokens: 114
I 00:00:00.436480 executorch:runner.cpp:504]    Model Load Time:                0.241000 (seconds)
I 00:00:00.436489 executorch:runner.cpp:514]    Total inference time:           0.181000 (seconds)               Rate:  629.834254 (tokens/second)
I 00:00:00.436497 executorch:runner.cpp:522]            Prompt evaluation:      0.008000 (seconds)               Rate:  625.000000 (tokens/second)
I 00:00:00.436504 executorch:runner.cpp:533]            Generated 114 tokens:   0.173000 (seconds)               Rate:  658.959538 (tokens/second)
I 00:00:00.436513 executorch:runner.cpp:541]    Time to first generated token:  0.008000 (seconds)
I 00:00:00.436522 executorch:runner.cpp:548]    Sampling time over 119 tokens:  0.005000 (seconds)

Results:

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she went toast onion inchiefly. She wanted to eat it, but it was too high up. 
Lily asked her mommy, intelligent, "Can you help me get the apple?"
The bird said, "Sure, I can fly up and get it for you."
The bird flew up to the apple and brought it off a little by accidentally dropped the birdie- she said, ju

shewu-quic · 2024-07-29T04:01:27Z

Actually do you mind dropping the stories pte file? I realize there is something off and want to double check the performance number.

Would you like to check 16a4w?

cccclai · 2024-07-29T05:25:08Z

Yeah 16a4w would be great. I'm getting ~200 toks/s and trying to figure out whether it's AOT issue or runtime issue. Also great job on figuring out the accuracy issue!

cccclai · 2024-07-29T18:46:01Z

I'm getting the following performance with your .pte file. Command line is ./llama_main --model_path stories_16a4w_sm8650_from_qcomm.pte --tokenizer_path llama2_tokenizer.bin...

PyTorchObserver {"prompt_tokens":9,"generated_tokens":118,"model_load_start_ms":1722276410105,"model_load_end_ms":1722276410334,"inference_start_ms":1722276410334,"inference_end_ms":1722276410603,"prompt_eval_end_ms":1722276410357,"first_token_ms":1722276410357,"aggregate_sampling_time_ms":94,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:00.517771 executorch:runner.cpp:509] 	Prompt Tokens: 9    Generated Tokens: 118
I 00:00:00.517775 executorch:runner.cpp:515] 	Model Load Time:		0.229000 (seconds)
I 00:00:00.517780 executorch:runner.cpp:525] 	Total inference time:		0.269000 (seconds)		 Rate: 	438.661710 (tokens/second)
I 00:00:00.517783 executorch:runner.cpp:533] 		Prompt evaluation:	0.023000 (seconds)		 Rate: 	391.304348 (tokens/second)
I 00:00:00.517787 executorch:runner.cpp:544] 		Generated 118 tokens:	0.246000 (seconds)		 Rate: 	479.674797 (tokens/second)
I 00:00:00.517791 executorch:runner.cpp:552] 	Time to first generated token:	0.023000 (seconds)
I 00:00:00.517793 executorch:runner.cpp:559] 	Sampling time over 127 tokens:	0.094000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters

cccclai · 2024-07-29T20:03:06Z

#4355 removes the RemoveRedundancy pass. Will it cause perf regression here?

shewu-quic · 2024-07-30T02:53:00Z

#4355 removes the RemoveRedundancy pass. Will it cause perf regression here?

I think there should be no impact as the final graph should be the same. I checked our ci and it doesn't show regression.
May I know what device you are running and the version of qnn?
Could I see your modified branch?

cccclai · 2024-07-30T03:44:45Z

#4355 removes the RemoveRedundancy pass. Will it cause perf regression here?

I think there should be no impact as the final graph should be the same. I checked our ci and it doesn't show regression. May I know what device you are running and the version of qnn? Could I see your modified branch?

That was just my thought when seeing their pr. After seeing the comment in the original PR, I think it should be good now. My current device is one plus 12 and ram is 16GB. What is your command to build llama runner?

shewu-quic · 2024-07-30T04:22:45Z

I think there should be no impact as the final graph should be the same. I checked our ci and it doesn't show regression. May I know what device you are running and the version of qnn? Could I see your modified branch?

That was just my thought when seeing their pr. After seeing the comment in the original PR, I think it should be good now. My current device is one plus 12 and ram is 16GB. What is your command to build llama runner?

Got it.
I build llama runner with .ci/scripts/build_llama_android.sh

#!/bin/bash
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

set -exu

# shellcheck source=/dev/null
source "$(dirname "${BASH_SOURCE[0]}")/utils.sh"

install_executorch_and_backend_lib() {
  echo "Installing executorch and xnnpack backend"
  rm -rf cmake-android-out && mkdir cmake-android-out
  # ANDROID_NDK=/opt/ndk
  # BUCK2=buck2
  ANDROID_ABI=arm64-v8a
  cmake -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK}/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DQNN_SDK_ROOT=$QNN_SDK_ROOT \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DXNNPACK_ENABLE_ARM_BF16=OFF \
    -Bcmake-android-out .

  cmake --build cmake-android-out -j16 --target install --config Release
}

build_llama_runner() {
    echo "Building llama runner for Android..."
    ANDROID_ABI=arm64-v8a
    cmake \
    -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK"/build/cmake/android.toolchain.cmake  \
    -DEXECUTORCH_USE_TIKTOKEN=ON \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out/examples/models/llama2 examples/models/llama2

    cmake --build cmake-android-out/examples/models/llama2 -j16 --config Release
}
# install_flatc_from_source
install_executorch_and_backend_lib
build_llama_runner

shewu-quic · 2024-07-31T08:11:10Z

Hey. Next, I will create three PRs to enable llama.

Add index op and index_put ops
- If you have some time, could you please take a look this.
Add model sharding mechanism with custom fallback op
Add source transform for kv cache and sdpa

Do you have any concerns for this plan?

Thanks a lot.

leigao97 · 2024-08-05T23:33:39Z

[ERROR] [Qnn ExecuTorch]:  <E> createUnsignedPD unsigned PD or DSPRPC_GET_DSP_INFO not supported by HTP
[ERROR] [Qnn ExecuTorch]:  <E> DspTransport.createUnsignedPD failed, 0x00000003
[ERROR] [Qnn ExecuTorch]:  <E> IDspTransport: Unknown rpc status 0x00000003
[ERROR] [Qnn ExecuTorch]:  <E> DspTransport failed,cannot open session, error 0xffffffff
[ERROR] [Qnn ExecuTorch]:  <E> Error from rpc transport. transportStatus = -1
[ERROR] [Qnn ExecuTorch]:  <E> Failed to retrieve skel build id: err: 1003
[ERROR] [Qnn ExecuTorch]:  <E> Failed to create a new transport session for deviceId 0, coreId 0, pdId 2: err: 14002
[ERROR] [Qnn ExecuTorch]:  <E> Error in creating transport session for deviceId 0, coreId 0, pdId 2, err: 14002
[ERROR] [Qnn ExecuTorch]:  <E> Fail to create context from binary with err 14002

On some devices, the above error failing to create pdid 2 can be related to the system especially on SM8650. If possible, please try to upgrade the system/OS version.

I encountered this issue on the OnePlus 12 with SM8650 chip. I have updated the OS but the problem is still there. Do you @chiwwang have other suggestions? My QNN version is 2.22.6.240515.

chiwwang · 2024-08-06T02:44:17Z

@leigao97 , if you saw exact pdId 2

<E> Failed to create a new transport session for deviceId 0, coreId 0, pdId 2: err: 14002

Then it's related to the system. It's hard to do anything on the application side.
I heard that this problem can be resolved on OnePlus 12 by upgrading the OS. Not sure if the OS upgrade is by regions so some regions have not gotten the fix yet.

I can only suggest

Use adb root to workaround if possible
Feedback the bug to OnePlus company, i.e., multiple Hexagon PD cannot be created in a single process.
Try QNN 2.23, but this might not resolve the problem.... mentioning this just because we're using QNN 2.23.

Summary: - Fully delegate meta llama model in fp and quantized - Add simple calibration - Use custom fallback op to split graph - Add model sharding argument - Add splill fill feature Note that if you want to run llama 7b due to memory limitations on the device, you need to specify num_sharding. And it is recommended to reboot device before running to ensure that the device has enough memory.

But it will result in embedding op fallback. If change pos_ids to int32, it will be fully delegated.

- Support GQA, repeating kv caches - Support TicTokenizer for llm/export/builder.py - Support --embedding-quantize option for qualcomm lowering flow

chunit-quic · 2024-08-06T09:13:38Z

Hi @cccclai,
We have upadated this PR for the lowering flow of llama3.
The following change can be used to build llama runner via .ci/scripts/build_llama_android.sh

@@ -25,6 +25,8 @@ install_executorch_and_backend_lib() {
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
+  -DEXECUTORCH_BUILD_QNN=ON \
+  -DQNN_SDK_ROOT=$QNN_SDK_ROOT \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
@@ -39,18 +41,20 @@ build_llama_runner() {
    ANDROID_ABI=arm64-v8a
    cmake -DBUCK2="${BUCK2}" \
    -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK"/build/cmake/android.toolchain.cmake  \
+  -DEXECUTORCH_USE_TIKTOKEN=ON \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
+  -DEXECUTORCH_BUILD_QNN=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out/examples/models/llama2 examples/models/llama2

    cmake --build cmake-android-out/examples/models/llama2 -j4 --config Release

About the exporting commands. here is an example for you, you can also add -E "4, 1024, 32" to export a quantized embedding version:

python -m examples.models.llama2.export_llama  -t ${tokenizer.model} -p ${params.json} -c ${consolidated.00.bf16.pth}  --use_kv_cache  --qnn --disable_dynamic_shape --num_sharding 1 --pt2e_quantize qnn_16a4w

Thank you

leigao97 · 2024-08-07T19:22:57Z

@chiwwang Thank you for the help. Rooting the device works for me.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 21, 2024

cccclai reviewed Jun 24, 2024

View reviewed changes

shewu-quic changed the title ~~[Draft] Qualcomm AI Engine Direct -Enable llama model in quantied and fp~~ [Draft] Qualcomm AI Engine Direct -Enable story llama model in quantied and fp Jun 25, 2024

chunit-quic mentioned this pull request Jul 3, 2024

Qualcomm AI Engine Direct - Support kv_cached stories 110M llama2 #4142

Closed

shewu-quic and others added 5 commits August 6, 2024 15:25

Back to int64 for inputs for minimum changed.

d8d0557

But it will result in embedding op fallback. If change pos_ids to int32, it will be fully delegated.

annotate matmul 16a8w

ddd9c62

[rebase to dev 20240806]

0eb89cf

[llama 3 wip]

e68e225

- Support GQA, repeating kv caches - Support TicTokenizer for llm/export/builder.py - Support --embedding-quantize option for qualcomm lowering flow

chunit-quic force-pushed the dev1/hutton/enable_story_llama_in_meta branch from d87d13d to e68e225 Compare August 6, 2024 08:59

shewu-quic closed this Sep 23, 2024

HSANGLEE mentioned this pull request Oct 14, 2024

llama3.2 1B model run on QNN backend produce wrong result #5929

Open

		@@ -266,6 +277,12 @@ class OpResizeNearestNeighbor:
		param_half_pixel_centers: str = "half_pixel_centers"


		@dataclass(init=False, frozen=True)

		@@ -107,6 +108,47 @@ def forward(
		return y.transpose(1, 2).contiguous().view(bsz, seqlen, self.dim)


		class SDPAQNN(torch.nn.Module):

[Draft] Qualcomm AI Engine Direct -Enable story llama model in quantied and fp #4030

[Draft] Qualcomm AI Engine Direct -Enable story llama model in quantied and fp #4030

Conversation

shewu-quic commented Jun 21, 2024 • edited Loading

pytorch-bot bot commented Jun 21, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4030

❌ 7 New Failures

xiaoxiaoyuwen commented Jun 21, 2024

shewu-quic commented Jun 24, 2024

cccclai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shewu-quic Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shewu-quic Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

chiwwang commented Jun 24, 2024

cccclai commented Jun 24, 2024

shewu-quic commented Jun 24, 2024

cccclai commented Jun 24, 2024

shewu-quic commented Jul 1, 2024

cccclai commented Jul 1, 2024

cccclai commented Jul 2, 2024

shewu-quic commented Jul 3, 2024

cccclai commented Jul 3, 2024 • edited Loading

shewu-quic commented Jul 3, 2024

chiwwang commented Jul 3, 2024

cccclai commented Jul 8, 2024

shewu-quic commented Jul 8, 2024

cccclai commented Jul 8, 2024

shewu-quic commented Jul 8, 2024 • edited Loading

cccclai commented Jul 8, 2024

shewu-quic commented Jul 8, 2024 • edited Loading

cccclai commented Jul 9, 2024

cccclai commented Jul 9, 2024

shewu-quic commented Jul 9, 2024 • edited Loading

Shared buffer (Zero copy)

Spill-Fill buffer

VTCM

shewu-quic commented Jul 9, 2024

shewu-quic commented Jul 9, 2024

cccclai commented Jul 9, 2024

cccclai commented Jul 24, 2024 • edited Loading

chiwwang commented Jul 24, 2024

cccclai commented Jul 24, 2024

chiwwang commented Jul 24, 2024

cccclai commented Jul 26, 2024

shewu-quic commented Jul 26, 2024

cccclai commented Jul 26, 2024

cccclai commented Jul 26, 2024

shewu-quic commented Jul 26, 2024 • edited Loading

shewu-quic commented Jul 29, 2024 • edited Loading

shewu-quic commented Jul 29, 2024

cccclai commented Jul 29, 2024

cccclai commented Jul 29, 2024

cccclai commented Jul 29, 2024

shewu-quic commented Jul 30, 2024

cccclai commented Jul 30, 2024

shewu-quic commented Jul 30, 2024

shewu-quic commented Jul 31, 2024 • edited Loading

leigao97 commented Aug 5, 2024 • edited Loading

chiwwang commented Aug 6, 2024

chunit-quic commented Aug 6, 2024 • edited Loading

leigao97 commented Aug 7, 2024

shewu-quic commented Jun 21, 2024 •

edited

Loading

pytorch-bot bot commented Jun 21, 2024 •

edited

Loading

shewu-quic Jun 24, 2024 •

edited

Loading

shewu-quic Jun 24, 2024 •

edited

Loading

cccclai commented Jul 3, 2024 •

edited

Loading

shewu-quic commented Jul 8, 2024 •

edited

Loading

shewu-quic commented Jul 8, 2024 •

edited

Loading

shewu-quic commented Jul 9, 2024 •

edited

Loading

cccclai commented Jul 24, 2024 •

edited

Loading

shewu-quic commented Jul 26, 2024 •

edited

Loading

shewu-quic commented Jul 29, 2024 •

edited

Loading

shewu-quic commented Jul 31, 2024 •

edited

Loading

leigao97 commented Aug 5, 2024 •

edited

Loading

chunit-quic commented Aug 6, 2024 •

edited

Loading