-
Notifications
You must be signed in to change notification settings - Fork 537
LLAMA runner with fp16 models #8465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Which fp16 model are you trying to run? If it's a llama model you can specify dtype in |
This is for running a model exported to pte using However when I try and use the exported model with llama runner (as in the example below) it throws an error. The error is:
These commands work for fp32 llama models. |
@SFrav fp16 might not be fully supported on XNNPack. @mcr229 @digantdesai what do you guys think? |
@SFrav what hardware are you trying to run your model on? it seems like the hardware isn't supported here. |
Testing on a laptop PC CPU. This works with fp32 pte models. |
what is your export command? And what is the runtime error you are getting? |
Exported using the DeepSeek distill guide – linked here As far as I can see, the only difference from the main LLAMA tutorial is in the data type, fp16 rather than fp32. There's also this [issue where there is some talk of memory limitations] (#7981). That would explain the use of data type in the tutorial.
The error is: |
Yeah so fp16 is not actually well supported today. I am a little surprised that we did not get any export time failure. |
The export process produced a 4gb pte file with one warning about kvcache repeatedly thrown. Seemed to work, but of course is untested. Perhaps just resolve this issue by changing the tutorial to use fp32 and an advisory that it needs more than 32gb of memory to export the pte (not exactly sure how much memory is needed). |
We should look into this, and also aim to get fp16 llama (albeit whatever variant works) in the CI, if not already. FWIW, eventual goal is to support fp32, bf16, and fp16, and priority is in that order. |
Closing - let's create an issue for this if there isn't one already |
What device are you targeting for? There are some other backends supporting fp16, like coreml, mps and qualcomm. |
Using XNNpack to target Android. I don't think any of those suit my text device. |
What edits to LLAMA runner are required to allow fp16 models to run? Is there an existing pull request that I can make use of?
cc @digantdesai @mcr229 @cbilgin @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel @jackzhxng
The text was updated successfully, but these errors were encountered: