Skip to content

[Draft] Qualcomm AI Engine Direct -Enable story llama model in quantied and fp #4030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

shewu-quic
Copy link
Collaborator

@shewu-quic shewu-quic commented Jun 21, 2024

Summary:

  • Fully delegated meta llama model in Qnn
  • Add simple calibration
  • Use custom fallback op to split graph
  • Add model sharding argument
  • Add splill fill feature.
  • Keep int64 input tensors for minimum changing of this PR. But it will result in embedding op fallback.
    If change pos_ids to int32, it will be fully delegated.

There are still accuracy issues for llama 7b in 16a4w and more complicated quantization algorithms are needed.
Note that if you want to run llama 7b due to memory limitations on the device, you need to specify num_sharding.
And it is recommended to reboot device before running to ensure that the device has enough memory.

Install executorch and backend lib:

cmake \
    -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK}/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DQNN_SDK_ROOT=$QNN_SDK_ROOT \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DXNNPACK_ENABLE_ARM_BF16=OFF \
    -Bcmake-android-out .

  cmake --build cmake-android-out -j4 --target install --config Release

Build llama runner:

cmake \
    -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK"/build/cmake/android.toolchain.cmake  \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out/examples/models/llama2 examples/models/llama2

    cmake --build cmake-android-out/examples/models/llama2 -j4 --config Release

Export llama in qnn:

# fp16
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
 -c <checkpoint.pth>  --use_kv_cache  --qnn --disable_dynamic_shape

# 8a8w
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
-c <checkpoint.pth>  --use_kv_cache  --qnn --pt2e_quantize qnn_8a8w

# 16a4w
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
-c <checkpoint.pth>   --use_kv_cache  --qnn --pt2e_quantize qnn_16a4w

# llama 7b 16a4w (recommend)
python -m examples.models.llama2.export_llama  -t tokenizer.model -p <params.json> \
-c <checkpoint.pth>   --use_kv_cache  --qnn --disable_dynamic_shape --num_sharding 8 \
--pt2e_quantize qnn_16a4w

Local Results:
llama-7b-chat with 8 splits in 16a4w
image

story llama in 16a4w
image

story llama in 8a8w
image

story llama in fp16
image

Copy link

pytorch-bot bot commented Jun 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4030

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures

As of commit e68e225 with merge base de300e0 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 21, 2024
@xiaoxiaoyuwen
Copy link

@shewu-quic great job! does it support llama2 7b?

@shewu-quic
Copy link
Collaborator Author

@shewu-quic great job! does it support llama2 7b?

Unfortunately, it does not support llama2 7b in this draft, but we are actively working on enabling llama2 7b.
We are investigating how to quantize llama2 7b with Qnn Quantizer to get reasonable accuracy. Maybe you could take a look another draft.

Copy link
Contributor

@cccclai cccclai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great :) I have some questions and would like to understand the motivation behind the changes. Thanks in advance!

@@ -266,6 +277,12 @@ class OpResizeNearestNeighbor:
param_half_pixel_centers: str = "half_pixel_centers"


@dataclass(init=False, frozen=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it used for index_put?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I choose Qnn ScatterND op to implement index put for llama use case. Because I have no idea to generate index_tensor with Qnn ScatterElements op.


class FuseConsecutiveTranspose(ExportPass):
"""
This pass fuses consecutive transpose / permute into one to reduce runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that the view_copy node before/after linear stays there, is there any specific reason we keep them?

Copy link
Collaborator Author

@shewu-quic shewu-quic Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need it because keep_dims is not supported for linear op in Qnn Htp.

@@ -248,12 +248,12 @@ Error Runner::generate(
"Sequence length exceeded - please increase the seq_len value passed to generate()");

// start the main loop
int64_t pos = 0; // position in the sequence
int32_t pos = 0; // position in the sequence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason we cast from int64_t to int32_t?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't support int64 well in Qnn HTP such as the index tensor of ScatterND op.

@@ -107,6 +108,47 @@ def forward(
return y.transpose(1, 2).contiguous().view(bsz, seqlen, self.dim)


class SDPAQNN(torch.nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rename it to something else, it's another sdpa replacement, not necessarily QNN specific

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great:)

def replace_causal_mask(module: torch.nn.Module):
for buffer_fqn_name, buffer in module.named_buffers():
buffer_name = buffer_fqn_name.split(".")[-1]
if buffer_name == "mask":
max_seq_len = buffer.shape[-1]
mask = torch.full(
(max_seq_len, max_seq_len),
float("-inf"),
float("-255"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason we replace inf with -255?

Copy link
Collaborator Author

@shewu-quic shewu-quic Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acutually, we have a pass to replace inf with min or max because inf is not friendly for quantization or computation in Qnn Htp. It could result in numerical error.

@chiwwang
Copy link
Contributor

@shewu-quic great job! does it support llama2 7b?

Unfortunately, it does not support llama2 7b in this draft, but we are actively working on enabling llama2 7b. We are investigating how to quantize llama2 7b with Qnn Quantizer to get reasonable accuracy. Maybe you could take a look another draft.

Another challenge we need to conquer is model sharding.

@cccclai
Copy link
Contributor

cccclai commented Jun 24, 2024

@shewu-quic great job! does it support llama2 7b?

Unfortunately, it does not support llama2 7b in this draft, but we are actively working on enabling llama2 7b. We are investigating how to quantize llama2 7b with Qnn Quantizer to get reasonable accuracy. Maybe you could take a look another draft.

Another challenge we need to conquer is model sharding.

Actually I have a version to support model sharding and can share the example code

@shewu-quic
Copy link
Collaborator Author

Hi @cccclai,

The accuracy issue seems to be related to insufficient calibration.
May I know do you have any plan to use more data to calibrate the model?
If I add the following, it will get reasonable English sentences in quantized model.

    def calibrate(self, module: torch.fx.GraphModule):
        from sentencepiece import SentencePieceProcessor
        sp_model = SentencePieceProcessor(model_file="tokenizer.model")

        # TODO: change criteria & support batch inputs if necessary
        pos = torch.tensor(0, dtype=torch.int32)
        token_list = [sp_model.bos_id()]
        user_prompts = ["Once", "upon", "a", "time"]
        for prompt in user_prompts:
            token_list += sp_model.encode(prompt)

        def sample_top_p(probs: torch.Tensor, top_p: float) -> torch.Tensor:
            probs_sort, probs_indices = torch.sort(probs, dim=-1, descending=True)
            probs_sum = torch.cumsum(probs_sort, dim=-1)
            mask = probs_sum - probs_sort > top_p
            probs_sort[mask] = 0
            probs_sort /= probs_sort.sum(dim=-1, keepdim=True)
            next_token = torch.multinomial(probs_sort, num_samples=1)
            return probs_indices.gather(dim=-1, index=next_token)

        with torch.no_grad():
            while token_list[-1] != sp_model.eos_id() and pos < 128:
                logits = module(
                    torch.full((1, 1), token_list[pos]),
                    torch.full((1, 1), pos),
                )
                pos += 1
                if pos >= len(token_list):
                    token_list.append(torch.argmax(logits[:, -1], dim=-1).item())
                    # probs = torch.softmax(logits[:, -1] / 0.8, dim=-1)
                    # token_list.append(sample_top_p(probs, 0.9).item())

        print(f"calibration data:\n{sp_model.decode(token_list)}")

....
                m = prepare_pt2e(self.pre_autograd_graph_module, composed_quantizer)
                # Calibrate
                self.calibrate(m)
                # m(*self.example_inputs)
                m = convert_pt2e(m)
....

@cccclai
Copy link
Contributor

cccclai commented Jun 24, 2024

If I add the following, it will get reasonable English sentences in quantized model.

Ah yes we will use a more generic to calibrate. I merged this pr (#3756) such that we can use the lm eval to calibrate the model

@shewu-quic shewu-quic changed the title [Draft] Qualcomm AI Engine Direct -Enable llama model in quantied and fp [Draft] Qualcomm AI Engine Direct -Enable story llama model in quantied and fp Jun 25, 2024
@shewu-quic
Copy link
Collaborator Author

Actually I have a version to support model sharding and can share the example code

May I know how you shard the model?
I have three ways of sharding the model but I think all are a bit hardcode....

  1. Fallback specified aten_add_tensor op, it seems there are fix number add op in each layer
  2. Insert clone op after the specific layer, in qnn we will fallback clone op.
  3. Re-write the Transformer

Ah yes we will use a more generic to calibrate. I merged this pr (#3756) such that we can use the lm eval to calibrate the model

Thanks for your information. Will it use to export_llama_lib?

@cccclai
Copy link
Contributor

cccclai commented Jul 1, 2024

shard

Sorry for the delay, was distracted by the performance review last week...I use the ExecutorBackend, and tag every 8 layers, will publish soon. I think having a noop op (maybe a custom op instead of clone because clone can also be expensive) for cutting the model can also be a generic way to shard model too.

@cccclai
Copy link
Contributor

cccclai commented Jul 2, 2024

This is my current change, still trying to debug an op but it's getting close..
model_sharding.patch

This is pretty much the idea
image

I think it still worth exploring the custom noop solution to break the graph. What is your preference?

@shewu-quic
Copy link
Collaborator Author

This is my current change, still trying to debug an op but it's getting close.. model_sharding.patch

This is pretty much the idea image

Wow, it makes me clear how to run the sharding model at runtime.
I will try this patch as soon as possible!!

I think it still worth exploring the custom noop solution to break the graph. What is your preference?

I think it is a good idea.
In fact, I have tried to hardcode insert custom op in llama_transformer.py and fallback it in qnn partitioner.
It should be work after I implement custom kernel implementation.
But I have no idea to generally insert custom op with a transformation. Do you have any idea?

# custom_fallback_op.py
from torch.library import impl, Library

fallback_op_lib = Library("qnn_llama", "DEF")

fallback_op_lib.define("fallback(Tensor input) -> Tensor") 


@impl(fallback_op_lib, "fallback", dispatch_key="CompositeExplicitAutograd")
def fallback_impl(a: torch.Tensor) -> torch.Tensor:
    return a


# registering the out variant.
fallback_op_lib.define(
    "fallback.out(Tensor input, *, Tensor(a!) output) -> Tensor(a!)"
)
# split_graph.py
class SplitGraph(ExportPass):
    def __init__(self, shares):
        super().__init__()
        self.shares = shares
    def _insert_fallback_op(
        self, graph_module: torch.fx.GraphModule
    ) -> torch.fx.GraphModule:
        for node in graph_module.graph.nodes:
            if "nn_module_stack" in node.meta:
                module_values_list = list(node.meta["nn_module_stack"].values())
                full_qualified_name = module_values_list[-1][0]
                owning_module = module_values_list[-1][1]
                print(f"[Hutton] node: {node}; full_qualified_name: {full_qualified_name}; owning_module: {owning_module}; meta: {node.meta}")
            # if node not in [the node which wants to find]:
            #   continue
            with graph_module.graph.inserting_after(node):
                users = list(node.users.keys())
                inserted_node = graph_module.graph.create_node(
                    "call_function",
                    exir_ops.edge.qnn_llama.fallback.default,
                    (node,),
                )
                inserted_node.meta["val"] = node.meta["val"]
                for user in users:
                    user.replace_input_with(node, inserted_node)
    def call(self, graph_module: torch.fx.GraphModule):
        self._insert_fallback_op(graph_module)
        graph_module.recompile()

@cccclai
Copy link
Contributor

cccclai commented Jul 3, 2024

This is great. I think if we have a custom graph break op, it doesn't have to qnn specific and can be applicable to other flow or backends.

But I have no idea to generally insert custom op with a transformation. Do you have any idea?

Like where to insert this custom op inside the graph? I feel like we can find the last node of 8 layer based on source_fn and module stack. Is it not working?

Another question is, I image we need to unload the qnn context binary in the graph break custom op. Is it what you're doing?

Also the patch is pretty much the idea. There is a bug I need to fix before it's working properly...I'll send another patch soon

@shewu-quic
Copy link
Collaborator Author

This is great. I think if we have a custom graph break op, it doesn't have to qnn specific and can be applicable to other flow or backends.

Sounds great.

But I have no idea to generally insert custom op with a transformation. Do you have any idea?

Like where to insert this custom op inside the graph? I feel like we can find the last node of 8 layer based on source_fn and module stack. Is it not working?

I originally thought so too but I found it will get multiple nodes in the same layer.
The last node of the layer is add node. However, you could find #L466 and #L470 which are the same source_fn and module stack.

def forward(self, x, freqs_cos, freqs_sin, input_pos=None): # x: 1xN

So maybe I also need stack_trace to identify which node we want. Is it stable?

Another question is, I image we need to unload the qnn context binary in the graph break custom op. Is it what you're doing?

Do you mean we need to handle the life cycle of the processed in the custom op?
Originally, we load qnn context binary in init function of QnnBackend and unload it in destroy function of QnnBackend. For the life cycle of qnn context binary is decided by processed which is kept by executorch runtime framework. Is this understanding correct?

Also the patch is pretty much the idea. There is a bug I need to fix before it's working properly...I'll send another patch soon

Thanks a lot.

@chiwwang
Copy link
Contributor

chiwwang commented Jul 3, 2024

This PR is somehow based on #4142

We will continue llama2-7b tasks by this PR.

@cccclai
Copy link
Contributor

cccclai commented Jul 8, 2024

The last node of the layer is add node. However, you could find #L466 and #L470 which are the same source_fn and module stack. So maybe I also need stack_trace to identify which node we want. Is it stable?

hmm was thinking if finding the last add node for the current layer sufficient, but maybe I miss something. Combing stact_trace also sounds reasonable.

Do you mean we need to handle the life cycle of the processed in the custom op?
Originally, we load qnn context binary in init function of QnnBackend and unload it in destroy function of QnnBackend. For the life cycle of qnn context binary is decided by processed which is kept by executorch runtime framework. Is this understanding correct?

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

@shewu-quic
Copy link
Collaborator Author

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

@cccclai
Copy link
Contributor

cccclai commented Jul 8, 2024

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

Do you mean you were able to use multi-context for the 7b model 😮 To my understanding, the multi context means multiple graphs in the qnn context binary. How does it work with 4 shards (4 set of graphs) in this case?

@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Jul 8, 2024

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

Do you mean you were able to use multi-context for the 7b model 😮 To my understanding, the multi context means multiple graphs in the qnn context binary. How does it work with 4 shards (4 set of graphs) in this case?

It works with multiple pte case. If we want to enable multi-context, we just need to set the right group handle for each pte which is the first context handle. For the purpose, we use a static variable to accomplish it And we need to set max_sf_buf_size which is the size of blob in AOT.
You could get the detail in the qnn doc.

@cccclai
Copy link
Contributor

cccclai commented Jul 8, 2024

Yeah that's my understanding too. However for 4 shards, we need to init(shard_1) -> destroy (shard_1) -> init(shard_2)-> destroy (shard_2) -> ..., if we do init(shard_1) -> init(shard_2) -> init(shard_3) -> init(shard_4) -> ... -> destroy (shard_1) ... -> destroy(shard_4), it will OOM in dsp?

I think it will not OOM if use mult-context feature because I could run composite llama on the device.

Do you mean you were able to use multi-context for the 7b model 😮 To my understanding, the multi context means multiple graphs in the qnn context binary. How does it work with 4 shards (4 set of graphs) in this case?

It works with multiple pte case. If we want to enable multi-context, we just need to set the right group handle for each pte which is the first context handle. For the purpose, we use a static variable to accomplish it And we need to set max_sf_buf_size which is the size of blob in AOT. You could get the detail in the qnn doc.

I was checking the doc

When multiple models are executed in sequence and as a result it is possible to reserve a single spill-fill allocation that could be re-used across all the splits. This has the benefit of reducing RAM usage for the application at negligible performance impact.

To my understanding the spill-fill is used for intermediate tensors among the splits. Like the split_1 -> output (in spill-fill) -> split_2. It's for the input/output like activation, but I'm not sure if it will do any optimization for weights. Did I miss anything?

@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Jul 8, 2024

To my understanding the spill-fill is used for intermediate tensors among the splits. Like the split_1 -> output (in spill-fill) -> split_2. It's for the input/output like activation, but I'm not sure if it will do any optimization for weights. Did I miss anything?

According to your description, it should be shared buffer (zero copy) which can eliminate data copy between multi ptes on the CPU and HTP accelerator. It's for the input/output of the graph.
We have implemented it in executorch and use it in our llama runner.

Spill-fill buffer sharing is optimization which is to allocate a buffer that will be shared by all the contexts of a LLM. This way, we do not need allocated space for each of the graphs.

@cccclai
Copy link
Contributor

cccclai commented Jul 9, 2024

Spill-fill buffer sharing is optimization which is to allocate a buffer that will be shared by all the contexts of a LLM. This way, we do not need allocated space for each of the graphs.

That's my understanding too and I thought it was for re-using the input/output across all splits in VTCM, but not for weights across all splits. Like

..act_1 -> split_1 -> act_2 -> split_2 -> act_3 -> split_ 4...

here act_1, act_2 and act_3 will share the same buffer, as known as the spill tensor buffer here.

@cccclai
Copy link
Contributor

cccclai commented Jul 9, 2024

Hey I probably need some help to fix a matmul validation error - it causes graph break but I'm not sure what's the issue. It only shows up after I apply the model sharding patch, but the graph inside the qnn_partitioner is supposed to be the same as the first layer of the graph.

I debug inside op_matmul.py, The input nodes for matmul are exactly the same for the validation success case and failure cases. The error code is

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> validateNativeOps aten_matmul_default:qti.aisw:MatMul op validator (quantized and FP16) failed 3110 and 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> QnnBackend_validateOpConfig failed 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to validate op aten_matmul_default with error 0xc26

For both op validation success and failure cases, the input nodes are exactly the same. The first matmul op input node is [aten__softmax_default, aten_permute_copy_default_5] and the second matmul input node [aten_permute_copy_default_3, aten_permute_copy_default_6]. Do I miss anything?

@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Jul 9, 2024

That's my understanding too and I thought it was for re-using the input/output across all splits in VTCM, but not for weights across all splits. Like

..act_1 -> split_1_1 -> intermediate tensor_0 -> split_1_2 -> act_2 -> split_2 -> act_3 -> split_ 4...

here act_1, act_2 and act_3 will share the same buffer, as known as the spill tensor buffer here.

I feel we are misaligned on some terms.

Shared buffer (Zero copy)

The propose is to avoid data copy between CPU and HTP. In addition, we could create a bigger rpc memory to be stored act_1, act_2,... etc. We have implemented in our llama2. It will create a rpc memeory to be stored for all input and output and just set the correct offset for each I/O tensor.
image

Spill-Fill buffer

VTCM space for each of the SoC is limited hence, when we need to make space within this region, the data is copied back to DDR (spill-fill buffer in this case). Therefore, we allocate one spill buffer for the intermediate tensors in graph (split).

VTCM

It is a hardware resource which provides fast store and load. It is controlled by HTP and we could only set the maximum usage for VTCM.

So back to your example, act_1, act_2 and act_3 (I/O tensor) will share the same buffer which is rpc memory instead of spill tensor buffer. For intermediate tensor in each graph (split), they will use a spill-fill buffer.

...act_1 (rpc_mem) -> split_1_1 -> intermediate tensor_0 ... (spill fill buffer)-> split_1_2 -> act_2 (rpc_mem) -> split_2 (spill fill buffer -> act_3 (rpc_mem) -> split_ 4 (spill fill buffer...

@shewu-quic
Copy link
Collaborator Author

Hey I probably need some help to fix a matmul validation error - it causes graph break but I'm not sure what's the issue. It only shows up after I apply the model sharding patch, but the graph inside the qnn_partitioner is supposed to be the same as the first layer of the graph.

I debug inside op_matmul.py, The input nodes for matmul are exactly the same for the validation success case and failure cases. The error code is

[ERROR] [Qnn ExecuTorch]: QnnDsp <E> validateNativeOps aten_matmul_default:qti.aisw:MatMul op validator (quantized and FP16) failed 3110 and 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> QnnBackend_validateOpConfig failed 3110
[ERROR] [Qnn ExecuTorch]: QnnDsp <E> Failed to validate op aten_matmul_default with error 0xc26

For both op validation success and failure cases, the input nodes are exactly the same. The first matmul op input node is [aten__softmax_default, aten_permute_copy_default_5] and the second matmul input node [aten_permute_copy_default_3, aten_permute_copy_default_6]. Do I miss anything?

May I know which version of QNN are you using?

@shewu-quic
Copy link
Collaborator Author

If you use quantization, I think the problem is missing quant attr or something wrong for quant parameter in meta of node. Could you help to check it?
image
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/HtpOpDefSupplement.html

@cccclai
Copy link
Contributor

cccclai commented Jul 9, 2024

I'm using qnn 2.23 and the matmul node meta data is
success case:

{'stack_trace': '  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 492, in forward\n    h = layer(\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 429, in forward\n    h = self.attention.forward(\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 330, in forward\n    output = self.SDPA(input_pos, q, k, v, bsz, seqlen, self.mask)\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/source_transformation/sdpa.py", line 144, in forward\n    attn_weight = q @ k.transpose(-2, -1) * scale_factor\n', 'nn_module_stack': {'L__self__': ('', 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'), 'L__self___layers_0': ('layers.0', 'executorch.examples.models.llama2.llama_transformer.TransformerBlock'), 'L__self___layers_0_attention_SDPA': ('layers.0.attention.SDPA', 'examples.models.llama2.source_transformation.sdpa.SDPAQNN')}, 'torch_fn': ('matmul.default_1', 'OpOverload.matmul.default'), 'source_fn_stack': [('matmul', <built-in function matmul>)], 'original_aten': <OpOverload(op='aten.view', overload='default')>, 'from_node': [('matmul', <built-in function matmul>), ('view_17', <OpOverload(op='aten.view', overload='default')>), ('view_copy_17', <OpOverload(op='aten.view_copy', overload='default')>), ('aten_matmul_default', <EdgeOpOverload: aten.matmul.default>: schema = aten::matmul(Tensor self, Tensor other) -> Tensor)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 8, 1, 128)), 'tensor_meta': None, 'debug_handle': 239, 'quant_attrs': {'scale': 3.819849371211603e-05, 'zero_point': 18610, 'quant_min': 0, 'quant_max': 65535, 'dtype': torch.int32, 'encoding': <EdgeOpOverload: quantized_decomposed.quantize_per_tensor.default>: schema = quantized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor}}

fail case

{'stack_trace': '  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 492, in forward\n    h = layer(\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 429, in forward\n    h = self.attention.forward(\n  File "/data/users/chenlai/executorch/examples/models/llama2/llama_transformer.py", line 330, in forward\n    output = self.SDPA(input_pos, q, k, v, bsz, seqlen, self.mask)\n  File "/home/chenlai/local/miniconda3/envs/executorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl\n    return forward_call(*args, **kwargs)\n  File "/data/users/chenlai/executorch/examples/models/llama2/source_transformation/sdpa.py", line 144, in forward\n    attn_weight = q @ k.transpose(-2, -1) * scale_factor\n', 'nn_module_stack': {'L__self__': ('', 'torch.fx.graph_module.GraphModule.__new__.<locals>.GraphModuleImpl'), 'L__self___layers_0': ('layers.0', 'executorch.examples.models.llama2.llama_transformer.TransformerBlock'), 'L__self___layers_0_attention_SDPA': ('layers.0.attention.SDPA', 'examples.models.llama2.source_transformation.sdpa.SDPAQNN')}, 'torch_fn': ('matmul.default_1', 'OpOverload.matmul.default'), 'source_fn_stack': [('matmul', <built-in function matmul>)], 'original_aten': <OpOverload(op='aten.view', overload='default')>, 'from_node': [('matmul', <built-in function matmul>), ('view_17', <OpOverload(op='aten.view', overload='default')>), ('view_copy_17', <OpOverload(op='aten.view_copy', overload='default')>), ('aten_matmul_default_1', <EdgeOpOverload: aten.matmul.default>: schema = aten::matmul(Tensor self, Tensor other) -> Tensor), ('quantized_decomposed_quantize_per_tensor_default_54', <EdgeOpOverload: quantized_decomposed.quantize_per_tensor.default>: schema = quantized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor), ('aten_matmul_default_1', <EdgeOpOverload: aten.matmul.default>: schema = aten::matmul(Tensor self, Tensor other) -> Tensor)], 'seq_nr': -1, 'val': FakeTensor(..., size=(1, 8, 1, 128)), 'tensor_meta': None, 'debug_handle': 239, 'delegation_tag': 'L__self___layers_0_1', 'quant_attrs': {'scale': 3.819849371211603e-05, 'zero_point': 18610, 'quant_min': 0, 'quant_max': 65535, 'dtype': torch.int32, 'encoding': <EdgeOpOverload: quantized_decomposed.quantize_per_tensor.default>: schema = quantized_decomposed::quantize_per_tensor(Tensor input, float scale, int zero_point, int quant_min, int quant_max, ScalarType dtype) -> Tensor}}

They look very similar....how would you debug next?

@cccclai
Copy link
Contributor

cccclai commented Jul 24, 2024

Hmm I think llama_main has the same error. I try it a bit more, and looks like dummy llama params work, but story params doesn’t. Can it be related to the check point data type?

@chiwwang
Copy link
Contributor

The error usually means invalid arguments in input or output tensors when calling QnnGraph_Execute().
mmm... if dummy params works but stories not, is it possible that something wrong in the size of inputs? e.g., the input tensor is constructed according to the size of dummy params, and is feed into storiesllama or so 🤔

@cccclai
Copy link
Contributor

cccclai commented Jul 24, 2024

hmm can I confirm with you the latency perf for stories and llama2 in the latest commit? Just would like to make sure we start from the same place

@chiwwang
Copy link
Contributor

It seems Hutton also pushed the annotation of 16a8w matmul... so we might see storiesllama 600 tokens/second and llama2 15 tokens/second if I remember correctly... mmm what number did you see?

@cccclai
Copy link
Contributor

cccclai commented Jul 26, 2024

As a note, this is the patch we apply for group query attention support, if we can lower torch.repeat_leaves directly, that's probably better.
group_query_attention.patch

@shewu-quic
Copy link
Collaborator Author

As a note, this is the patch we apply for group query attention support, if we can lower torch.repeat_leaves directly, that's probably better. group_query_attention.patch

Thanks a lot for patch. We will try this patch and focus on enable llama3.

@cccclai
Copy link
Contributor

cccclai commented Jul 26, 2024

Actually do you mind dropping the stories pte file? I realize there is something off and want to double check the performance number.

@cccclai
Copy link
Contributor

cccclai commented Jul 26, 2024

As a note, this is the patch we apply for group query attention support, if we can lower torch.repeat_leaves directly, that's probably better. group_query_attention.patch

oh also I have it combined with [the other matmul annotation[(https://github.com/pytorch/executorch/blob/faeeca8ec9040ae2db23973139c1b5f71ea51d4c/examples/qualcomm/llama2/llama.py#L59) as the cat annotation seems off

@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Jul 26, 2024

Actually do you mind dropping the stories pte file? I realize there is something off and want to double check the performance number.

Sorry about that it seems accuracy issue for story llama in fp and quantized mode when I rebuild the runner and backend lib in this PR.
I am investigating what might be wrong.

Performance:

I 00:00:00.420899 executorch:runner.cpp:498]    Prompt Tokens: 5    Generated Tokens: 114
I 00:00:00.420907 executorch:runner.cpp:504]    Model Load Time:                0.216000 (seconds)
I 00:00:00.420917 executorch:runner.cpp:514]    Total inference time:           0.190000 (seconds)               Rate:  600.000000 (tokens/second)
I 00:00:00.420926 executorch:runner.cpp:522]            Prompt evaluation:      0.010000 (seconds)               Rate:  500.000000 (tokens/second)
I 00:00:00.420933 executorch:runner.cpp:533]            Generated 114 tokens:   0.180000 (seconds)               Rate:  633.333333 (tokens/second)
I 00:00:00.420941 executorch:runner.cpp:541]    Time to first generated token:  0.010000 (seconds)
I 00:00:00.420947 executorch:runner.cpp:548]    Sampling time over 119 tokens:  0.007000 (seconds)

Results:

Once upon a timeing) @ K    less zwischengraphabряv mus]"ilponse tempor nou %>](                vhnaction)log previous conent n of(amentoremeinv mus alternil Sho wh)per wh K стgenly cre neueza totalory    ane)    irlsätt)ir Romtetil://лаou hibernate semenderättсе)ir"ans)deindtechoryotла whlyteatingenderätt modified(oryilou hibernateotże whational    lessachonymous cont strategyouättulоlogсе)osa

@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Jul 29, 2024

Oops, I got it. It seems that tokenizer has some changes and I need to regenerate tokenize.bin.
Performance:

I 00:00:00.436471 executorch:runner.cpp:498]    Prompt Tokens: 5    Generated Tokens: 114
I 00:00:00.436480 executorch:runner.cpp:504]    Model Load Time:                0.241000 (seconds)
I 00:00:00.436489 executorch:runner.cpp:514]    Total inference time:           0.181000 (seconds)               Rate:  629.834254 (tokens/second)
I 00:00:00.436497 executorch:runner.cpp:522]            Prompt evaluation:      0.008000 (seconds)               Rate:  625.000000 (tokens/second)
I 00:00:00.436504 executorch:runner.cpp:533]            Generated 114 tokens:   0.173000 (seconds)               Rate:  658.959538 (tokens/second)
I 00:00:00.436513 executorch:runner.cpp:541]    Time to first generated token:  0.008000 (seconds)
I 00:00:00.436522 executorch:runner.cpp:548]    Sampling time over 119 tokens:  0.005000 (seconds)

Results:

Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she went toast onion inchiefly. She wanted to eat it, but it was too high up. 
Lily asked her mommy, intelligent, "Can you help me get the apple?"
The bird said, "Sure, I can fly up and get it for you."
The bird flew up to the apple and brought it off a little by accidentally dropped the birdie- she said, ju

@shewu-quic
Copy link
Collaborator Author

Actually do you mind dropping the stories pte file? I realize there is something off and want to double check the performance number.

Would you like to check 16a4w?

@cccclai
Copy link
Contributor

cccclai commented Jul 29, 2024

Yeah 16a4w would be great. I'm getting ~200 toks/s and trying to figure out whether it's AOT issue or runtime issue. Also great job on figuring out the accuracy issue!

@cccclai
Copy link
Contributor

cccclai commented Jul 29, 2024

I'm getting the following performance with your .pte file. Command line is ./llama_main --model_path stories_16a4w_sm8650_from_qcomm.pte --tokenizer_path llama2_tokenizer.bin...

PyTorchObserver {"prompt_tokens":9,"generated_tokens":118,"model_load_start_ms":1722276410105,"model_load_end_ms":1722276410334,"inference_start_ms":1722276410334,"inference_end_ms":1722276410603,"prompt_eval_end_ms":1722276410357,"first_token_ms":1722276410357,"aggregate_sampling_time_ms":94,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:00.517771 executorch:runner.cpp:509] 	Prompt Tokens: 9    Generated Tokens: 118
I 00:00:00.517775 executorch:runner.cpp:515] 	Model Load Time:		0.229000 (seconds)
I 00:00:00.517780 executorch:runner.cpp:525] 	Total inference time:		0.269000 (seconds)		 Rate: 	438.661710 (tokens/second)
I 00:00:00.517783 executorch:runner.cpp:533] 		Prompt evaluation:	0.023000 (seconds)		 Rate: 	391.304348 (tokens/second)
I 00:00:00.517787 executorch:runner.cpp:544] 		Generated 118 tokens:	0.246000 (seconds)		 Rate: 	479.674797 (tokens/second)
I 00:00:00.517791 executorch:runner.cpp:552] 	Time to first generated token:	0.023000 (seconds)
I 00:00:00.517793 executorch:runner.cpp:559] 	Sampling time over 127 tokens:	0.094000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters

@cccclai
Copy link
Contributor

cccclai commented Jul 29, 2024

#4355 removes the RemoveRedundancy pass. Will it cause perf regression here?

@shewu-quic
Copy link
Collaborator Author

#4355 removes the RemoveRedundancy pass. Will it cause perf regression here?

I think there should be no impact as the final graph should be the same. I checked our ci and it doesn't show regression.
May I know what device you are running and the version of qnn?
Could I see your modified branch?

@cccclai
Copy link
Contributor

cccclai commented Jul 30, 2024

#4355 removes the RemoveRedundancy pass. Will it cause perf regression here?

I think there should be no impact as the final graph should be the same. I checked our ci and it doesn't show regression. May I know what device you are running and the version of qnn? Could I see your modified branch?

That was just my thought when seeing their pr. After seeing the comment in the original PR, I think it should be good now. My current device is one plus 12 and ram is 16GB. What is your command to build llama runner?

@shewu-quic
Copy link
Collaborator Author

I think there should be no impact as the final graph should be the same. I checked our ci and it doesn't show regression. May I know what device you are running and the version of qnn? Could I see your modified branch?

That was just my thought when seeing their pr. After seeing the comment in the original PR, I think it should be good now. My current device is one plus 12 and ram is 16GB. What is your command to build llama runner?

Got it.
I build llama runner with .ci/scripts/build_llama_android.sh

#!/bin/bash
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

set -exu

# shellcheck source=/dev/null
source "$(dirname "${BASH_SOURCE[0]}")/utils.sh"

install_executorch_and_backend_lib() {
  echo "Installing executorch and xnnpack backend"
  rm -rf cmake-android-out && mkdir cmake-android-out
  # ANDROID_NDK=/opt/ndk
  # BUCK2=buck2
  ANDROID_ABI=arm64-v8a
  cmake -DCMAKE_TOOLCHAIN_FILE="${ANDROID_NDK}/build/cmake/android.toolchain.cmake" \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_ENABLE_LOGGING=1 \
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DQNN_SDK_ROOT=$QNN_SDK_ROOT \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DXNNPACK_ENABLE_ARM_BF16=OFF \
    -Bcmake-android-out .

  cmake --build cmake-android-out -j16 --target install --config Release
}

build_llama_runner() {
    echo "Building llama runner for Android..."
    ANDROID_ABI=arm64-v8a
    cmake \
    -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK"/build/cmake/android.toolchain.cmake  \
    -DEXECUTORCH_USE_TIKTOKEN=ON \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_QNN=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out/examples/models/llama2 examples/models/llama2

    cmake --build cmake-android-out/examples/models/llama2 -j16 --config Release
}
# install_flatc_from_source
install_executorch_and_backend_lib
build_llama_runner

@shewu-quic
Copy link
Collaborator Author

shewu-quic commented Jul 31, 2024

Hey. Next, I will create three PRs to enable llama.

  • Add index op and index_put ops
    • If you have some time, could you please take a look this.
  • Add model sharding mechanism with custom fallback op
  • Add source transform for kv cache and sdpa

Do you have any concerns for this plan?

Thanks a lot.

@leigao97
Copy link

leigao97 commented Aug 5, 2024

[ERROR] [Qnn ExecuTorch]:  <E> createUnsignedPD unsigned PD or DSPRPC_GET_DSP_INFO not supported by HTP
[ERROR] [Qnn ExecuTorch]:  <E> DspTransport.createUnsignedPD failed, 0x00000003
[ERROR] [Qnn ExecuTorch]:  <E> IDspTransport: Unknown rpc status 0x00000003
[ERROR] [Qnn ExecuTorch]:  <E> DspTransport failed,cannot open session, error 0xffffffff
[ERROR] [Qnn ExecuTorch]:  <E> Error from rpc transport. transportStatus = -1
[ERROR] [Qnn ExecuTorch]:  <E> Failed to retrieve skel build id: err: 1003
[ERROR] [Qnn ExecuTorch]:  <E> Failed to create a new transport session for deviceId 0, coreId 0, pdId 2: err: 14002
[ERROR] [Qnn ExecuTorch]:  <E> Error in creating transport session for deviceId 0, coreId 0, pdId 2, err: 14002
[ERROR] [Qnn ExecuTorch]:  <E> Fail to create context from binary with err 14002

On some devices, the above error failing to create pdid 2 can be related to the system especially on SM8650. If possible, please try to upgrade the system/OS version.

I encountered this issue on the OnePlus 12 with SM8650 chip. I have updated the OS but the problem is still there. Do you @chiwwang have other suggestions? My QNN version is 2.22.6.240515.

@chiwwang
Copy link
Contributor

chiwwang commented Aug 6, 2024

@leigao97 , if you saw exact pdId 2

<E> Failed to create a new transport session for deviceId 0, coreId 0, pdId 2: err: 14002

Then it's related to the system. It's hard to do anything on the application side.
I heard that this problem can be resolved on OnePlus 12 by upgrading the OS. Not sure if the OS upgrade is by regions so some regions have not gotten the fix yet.

I can only suggest

  1. Use adb root to workaround if possible
  2. Feedback the bug to OnePlus company, i.e., multiple Hexagon PD cannot be created in a single process.
  3. Try QNN 2.23, but this might not resolve the problem.... mentioning this just because we're using QNN 2.23.

shewu-quic and others added 5 commits August 6, 2024 15:25
Summary:
- Fully delegate meta llama model in fp and quantized
- Add simple calibration
- Use custom fallback op to split graph
- Add model sharding argument
- Add splill fill feature

Note that if you want to run llama 7b due to memory
limitations on the device, you need to specify num_sharding.
And it is recommended to reboot device before running to ensure that the
device has enough memory.
But it will result in embedding op fallback.
If change pos_ids to int32, it will be fully delegated.
- Support GQA, repeating kv caches
- Support TicTokenizer for llm/export/builder.py
- Support --embedding-quantize option for qualcomm lowering flow
@chunit-quic chunit-quic force-pushed the dev1/hutton/enable_story_llama_in_meta branch from d87d13d to e68e225 Compare August 6, 2024 08:59
@chunit-quic
Copy link
Collaborator

chunit-quic commented Aug 6, 2024

Hi @cccclai,
We have upadated this PR for the lowering flow of llama3.
The following change can be used to build llama runner via .ci/scripts/build_llama_android.sh

@@ -25,6 +25,8 @@ install_executorch_and_backend_lib() {
    -DEXECUTORCH_BUILD_EXTENSION_MODULE=ON \
    -DEXECUTORCH_BUILD_EXTENSION_DATA_LOADER=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
+  -DEXECUTORCH_BUILD_QNN=ON \
+  -DQNN_SDK_ROOT=$QNN_SDK_ROOT \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
@@ -39,18 +41,20 @@ build_llama_runner() {
    ANDROID_ABI=arm64-v8a
    cmake -DBUCK2="${BUCK2}" \
    -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK"/build/cmake/android.toolchain.cmake  \
+  -DEXECUTORCH_USE_TIKTOKEN=ON \
    -DANDROID_ABI="${ANDROID_ABI}" \
    -DANDROID_PLATFORM=android-23 \
    -DCMAKE_INSTALL_PREFIX=cmake-android-out \
    -DCMAKE_BUILD_TYPE=Release -DPYTHON_EXECUTABLE=python \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
+  -DEXECUTORCH_BUILD_QNN=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -Bcmake-android-out/examples/models/llama2 examples/models/llama2

    cmake --build cmake-android-out/examples/models/llama2 -j4 --config Release

About the exporting commands. here is an example for you, you can also add -E "4, 1024, 32" to export a quantized embedding version:

python -m examples.models.llama2.export_llama  -t ${tokenizer.model} -p ${params.json} -c ${consolidated.00.bf16.pth}  --use_kv_cache  --qnn --disable_dynamic_shape --num_sharding 1 --pt2e_quantize qnn_16a4w

Thank you

@leigao97
Copy link

leigao97 commented Aug 7, 2024

@chiwwang Thank you for the help. Rooting the device works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants