Skip to content

LLAMA runner with fp16 models #8465

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SFrav opened this issue Feb 13, 2025 · 13 comments
Closed

LLAMA runner with fp16 models #8465

SFrav opened this issue Feb 13, 2025 · 13 comments
Assignees
Labels
module: examples Issues related to demos under examples/ module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@SFrav
Copy link

SFrav commented Feb 13, 2025

What edits to LLAMA runner are required to allow fp16 models to run? Is there an existing pull request that I can make use of?

cc @digantdesai @mcr229 @cbilgin @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel @jackzhxng

@jackzhxng
Copy link
Contributor

Which fp16 model are you trying to run? If it's a llama model you can specify dtype in export_llama with -d

@jackzhxng jackzhxng added module: examples Issues related to demos under examples/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 13, 2025
@github-project-automation github-project-automation bot moved this to To triage in ExecuTorch Core Feb 13, 2025
@SFrav
Copy link
Author

SFrav commented Feb 13, 2025

This is for running a model exported to pte using export_llama with -d fp16

However when I try and use the exported model with llama runner (as in the example below) it throws an error.

The error is: XNN Runtime creation failed with code: xnn_status_unsupported_hardware ... XNN::compileModel failed: 0x1

cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama

cmake --build cmake-out/examples/models/llama -j16 --config Release
cmake-out/examples/models/llama/llama_main --model_path=<model pte file> --tokenizer_path=<tokenizer.model> --prompt=<prompt>

These commands work for fp32 llama models.

@lucylq lucylq moved this from To triage to In progress in ExecuTorch Core Feb 13, 2025
@tarun292
Copy link
Contributor

@SFrav fp16 might not be fully supported on XNNPack. @mcr229 @digantdesai what do you guys think?

@tarun292 tarun292 added the module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ label Feb 13, 2025
@mcr229
Copy link
Contributor

mcr229 commented Feb 13, 2025

@SFrav what hardware are you trying to run your model on? it seems like the hardware isn't supported here.

@SFrav
Copy link
Author

SFrav commented Feb 13, 2025

@SFrav what hardware are you trying to run your model on? it seems like the hardware isn't supported here.

Testing on a laptop PC CPU. This works with fp32 pte models.

@kimishpatel
Copy link
Contributor

what is your export command? And what is the runtime error you are getting?

@SFrav
Copy link
Author

SFrav commented Feb 14, 2025

what is your export command? And what is the runtime error you are getting?

Exported using the DeepSeek distill guide – linked here

As far as I can see, the only difference from the main LLAMA tutorial is in the data type, fp16 rather than fp32.

There's also this [issue where there is some talk of memory limitations] (#7981). That would explain the use of data type in the tutorial.

python -m examples.models.llama.export_llama \
    --checkpoint /tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/checkpoint.pth \
	-p /tmp/params.json \
	-kv \
	--use_sdpa_with_kv_cache \
	-X \
	-qmode 8da4w \
	--group_size 128 \
	-d fp16 \
	--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
	--embedding-quantize 4,32 \
	--output_name="DeepSeek-R1-Distill-Llama-8B.pte"

The error is: XNN Runtime creation failed with code: xnn_status_unsupported_hardware ... XNN::compileModel failed: 0x1

@kimishpatel
Copy link
Contributor

Yeah so fp16 is not actually well supported today. I am a little surprised that we did not get any export time failure.

@SFrav
Copy link
Author

SFrav commented Feb 14, 2025

Yeah so fp16 is not actually well supported today. I am a little surprised that we did not get any export time failure.

The export process produced a 4gb pte file with one warning about kvcache repeatedly thrown. Seemed to work, but of course is untested.

Perhaps just resolve this issue by changing the tutorial to use fp32 and an advisory that it needs more than 32gb of memory to export the pte (not exactly sure how much memory is needed).

@digantdesai
Copy link
Contributor

digantdesai commented Feb 14, 2025

The error is: XNN Runtime creation failed with code: xnn_status_unsupported_hardware ... XNN::compileModel failed: 0x1

We should look into this, and also aim to get fp16 llama (albeit whatever variant works) in the CI, if not already. FWIW, eventual goal is to support fp32, bf16, and fp16, and priority is in that order.

@jackzhxng
Copy link
Contributor

Closing - let's create an issue for this if there isn't one already

@cccclai
Copy link
Contributor

cccclai commented Feb 18, 2025

What device are you targeting for? There are some other backends supporting fp16, like coreml, mps and qualcomm.

@SFrav
Copy link
Author

SFrav commented Feb 18, 2025

What device are you targeting for? There are some other backends supporting fp16, like coreml, mps and qualcomm.

Using XNNpack to target Android. I don't think any of those suit my text device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: examples Issues related to demos under examples/ module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Status: Done
Development

No branches or pull requests

7 participants