LLAMA runner with fp16 models #8465

SFrav · 2025-02-13T16:25:50Z

What edits to LLAMA runner are required to allow fp16 models to run? Is there an existing pull request that I can make use of?

cc @digantdesai @mcr229 @cbilgin @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel @jackzhxng

jackzhxng · 2025-02-13T17:40:12Z

Which fp16 model are you trying to run? If it's a llama model you can specify dtype in export_llama with -d

SFrav · 2025-02-13T17:52:38Z

This is for running a model exported to pte using export_llama with -d fp16

However when I try and use the exported model with llama runner (as in the example below) it throws an error.

The error is: XNN Runtime creation failed with code: xnn_status_unsupported_hardware ... XNN::compileModel failed: 0x1

cmake -DPYTHON_EXECUTABLE=python \
    -DCMAKE_INSTALL_PREFIX=cmake-out \
    -DCMAKE_BUILD_TYPE=Release \
    -DEXECUTORCH_BUILD_KERNELS_CUSTOM=ON \
    -DEXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON \
    -DEXECUTORCH_BUILD_XNNPACK=ON \
    -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON \
    -Bcmake-out/examples/models/llama \
    examples/models/llama

cmake --build cmake-out/examples/models/llama -j16 --config Release

cmake-out/examples/models/llama/llama_main --model_path=<model pte file> --tokenizer_path=<tokenizer.model> --prompt=<prompt>

These commands work for fp32 llama models.

tarun292 · 2025-02-13T19:21:09Z

@SFrav fp16 might not be fully supported on XNNPack. @mcr229 @digantdesai what do you guys think?

mcr229 · 2025-02-13T20:08:11Z

@SFrav what hardware are you trying to run your model on? it seems like the hardware isn't supported here.

SFrav · 2025-02-13T20:16:56Z

@SFrav what hardware are you trying to run your model on? it seems like the hardware isn't supported here.

Testing on a laptop PC CPU. This works with fp32 pte models.

kimishpatel · 2025-02-14T04:45:12Z

what is your export command? And what is the runtime error you are getting?

SFrav · 2025-02-14T09:27:06Z

what is your export command? And what is the runtime error you are getting?

Exported using the DeepSeek distill guide – linked here

As far as I can see, the only difference from the main LLAMA tutorial is in the data type, fp16 rather than fp32.

There's also this [issue where there is some talk of memory limitations] (#7981). That would explain the use of data type in the tutorial.

python -m examples.models.llama.export_llama \
    --checkpoint /tmp/deepseek-ai/DeepSeek-R1-Distill-Llama-8B/checkpoint.pth \
	-p /tmp/params.json \
	-kv \
	--use_sdpa_with_kv_cache \
	-X \
	-qmode 8da4w \
	--group_size 128 \
	-d fp16 \
	--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
	--embedding-quantize 4,32 \
	--output_name="DeepSeek-R1-Distill-Llama-8B.pte"

The error is: XNN Runtime creation failed with code: xnn_status_unsupported_hardware ... XNN::compileModel failed: 0x1

kimishpatel · 2025-02-14T15:27:18Z

Yeah so fp16 is not actually well supported today. I am a little surprised that we did not get any export time failure.

SFrav · 2025-02-14T15:37:52Z

Yeah so fp16 is not actually well supported today. I am a little surprised that we did not get any export time failure.

The export process produced a 4gb pte file with one warning about kvcache repeatedly thrown. Seemed to work, but of course is untested.

Perhaps just resolve this issue by changing the tutorial to use fp32 and an advisory that it needs more than 32gb of memory to export the pte (not exactly sure how much memory is needed).

digantdesai · 2025-02-14T17:09:10Z

The error is: XNN Runtime creation failed with code: xnn_status_unsupported_hardware ... XNN::compileModel failed: 0x1

We should look into this, and also aim to get fp16 llama (albeit whatever variant works) in the CI, if not already. FWIW, eventual goal is to support fp32, bf16, and fp16, and priority is in that order.

jackzhxng · 2025-02-14T18:57:53Z

Closing - let's create an issue for this if there isn't one already

cccclai · 2025-02-18T19:34:38Z

What device are you targeting for? There are some other backends supporting fp16, like coreml, mps and qualcomm.

SFrav · 2025-02-18T22:21:33Z

What device are you targeting for? There are some other backends supporting fp16, like coreml, mps and qualcomm.

Using XNNpack to target Android. I don't think any of those suit my text device.

jackzhxng added module: examples Issues related to demos under examples/ triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Feb 13, 2025

github-project-automation bot added this to ExecuTorch Core Feb 13, 2025

github-project-automation bot moved this to To triage in ExecuTorch Core Feb 13, 2025

lucylq assigned jackzhxng Feb 13, 2025

lucylq added this to etLLM: LLMs via ExecuTorch Feb 13, 2025

lucylq moved this from To triage to In progress in ExecuTorch Core Feb 13, 2025

tarun292 added the module: xnnpack Issues related to xnnpack delegation and the code under backends/xnnpack/ label Feb 13, 2025

jackzhxng closed this as completed Feb 14, 2025

github-project-automation bot moved this to Done in etLLM: LLMs via ExecuTorch Feb 14, 2025

github-project-automation bot moved this from In progress to Done in ExecuTorch Core Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLAMA runner with fp16 models #8465

LLAMA runner with fp16 models #8465

SFrav commented Feb 13, 2025 •

edited by pytorch-bot bot

Loading

jackzhxng commented Feb 13, 2025

SFrav commented Feb 13, 2025 •

edited

Loading

tarun292 commented Feb 13, 2025

mcr229 commented Feb 13, 2025

SFrav commented Feb 13, 2025 •

edited

Loading

kimishpatel commented Feb 14, 2025

SFrav commented Feb 14, 2025 •

edited

Loading

kimishpatel commented Feb 14, 2025

SFrav commented Feb 14, 2025 •

edited

Loading

digantdesai commented Feb 14, 2025 •

edited

Loading

jackzhxng commented Feb 14, 2025

cccclai commented Feb 18, 2025

SFrav commented Feb 18, 2025

LLAMA runner with fp16 models #8465

LLAMA runner with fp16 models #8465

Comments

SFrav commented Feb 13, 2025 • edited by pytorch-bot bot Loading

jackzhxng commented Feb 13, 2025

SFrav commented Feb 13, 2025 • edited Loading

tarun292 commented Feb 13, 2025

mcr229 commented Feb 13, 2025

SFrav commented Feb 13, 2025 • edited Loading

kimishpatel commented Feb 14, 2025

SFrav commented Feb 14, 2025 • edited Loading

kimishpatel commented Feb 14, 2025

SFrav commented Feb 14, 2025 • edited Loading

digantdesai commented Feb 14, 2025 • edited Loading

jackzhxng commented Feb 14, 2025

cccclai commented Feb 18, 2025

SFrav commented Feb 18, 2025

SFrav commented Feb 13, 2025 •

edited by pytorch-bot bot

Loading

SFrav commented Feb 13, 2025 •

edited

Loading

SFrav commented Feb 13, 2025 •

edited

Loading

SFrav commented Feb 14, 2025 •

edited

Loading

SFrav commented Feb 14, 2025 •

edited

Loading

digantdesai commented Feb 14, 2025 •

edited

Loading