You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pull Request resolved: #3038
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py
stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```
Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] Prompt Tokens: 2 Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] Model Load Time: 0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] Total inference time: 3.454000 (seconds) Rate: 36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] Prompt evaluation: 0.057000 (seconds) Rate: 35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] Generated 125 tokens: 3.397000 (seconds) Rate: 36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] Time to first generated token: 0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] Sampling time over 127 tokens: 0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```
Stories model is too small and sensitive to qunatization.
ghstack-source-id: 223152097
@exported-using-ghexport
Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
help="Use PT2E quantization. Comma separated options. e.g. xnnpack_dynamic (for per channel 8 bit weight), xnnpack_dynamic_qc4 (for per channel 4 bit weight), embedding.",
0 commit comments