Skip to content

qnn end to end flow #3038

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed

qnn end to end flow #3038

wants to merge 13 commits into from

Conversation

cccclai
Copy link
Contributor

@cccclai cccclai commented Apr 14, 2024

Stack from ghstack (oldest at bottom):

Patch a few changes including:

  • support bool tensor type
  • support fp16 and fix the 8w8a quantization.
  • add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:

python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json

quantize:

python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json

Runtime:

/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"

Output:

Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.

Stories model is too small and sensitive to qunatization.

Differential Revision: D56119738

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Apr 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/3038

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 357e94e with merge base 2c467dd (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 14, 2024
cccclai added a commit that referenced this pull request Apr 14, 2024
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

ghstack-source-id: 222465750
Pull Request resolved: #3038
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 14, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 222465994
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 14, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 222466043
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 14, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 222471499
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 14, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 222473434
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 15, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside in the sunshine. One day, she saw a big, red apple hanging from a tree. She wanted to eat it, but it was too high up..
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 222613601
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 16, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 222650468
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 18, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 223091081
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 19, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 223136109
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 19, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 223152097
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 19, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 223153222
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 19, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 223160858
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a boy named Tim. Tim had a pet dog named Max. Max was a big, strong dog. They liked to play and run in the park.
One day, Tim and Max went to the park to play. They saw a cat. The cat was up in a tree. Max wanted to help the cat. He tried to climb the tree, but he could not.
Then, something unexpected happened. Max started to climb the tree! He was very strong. Max helped the cat come down. The cat was happy. Tim was so proud of his pet.
```

Stories model is too small and sensitive to qunatization.

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56119738

cccclai added a commit that referenced this pull request Apr 19, 2024
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 223199545
@exported-using-ghexport

Differential Revision: [D56119738](https://our.internmc.facebook.com/intern/diff/D56119738/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 3257c66.

cccclai added a commit that referenced this pull request Apr 19, 2024
Summary:
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 223199545
exported-using-ghexport

Reviewed By: mergennachin, kirklandsign

Differential Revision: D56119738

fbshipit-source-id: daf5563fe51a677f302e09ae8a9fb80e6bda72c5
(cherry picked from commit 3257c66)
guangy10 pushed a commit that referenced this pull request Apr 20, 2024
Summary:
Pull Request resolved: #3038

Patch a few changes including:
- support bool tensor type
- support fp16 and fix the 8w8a quantization.
- add two non-supported ops (slice_scatter and index_put) in common_defs.py

stories model working end to end:
AOT:
fp16:
```
python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json
```
quantize:
```
python -m examples.models.llama2.export_llama -kv --qnn --pt2e_quantize qnn_8a8w -c stories110M.pt -p params.json
```

Runtime:
```
/llama_main --model_path=llama2_fp16_qnn_2.21.pte  --tokenizer_path=tokenizer.bin --prompt="Once"
```
Output:
```
Once upon a time, there was a little girl named Lily. She loved to play outside and explore the world around her. One day, she went on a walk with her mommy and they found a beautiful landscape with lots of trees and flowers.
Lily said, "Mommy, this place is so pretty! Can we take a picture?"
Mommy replied, "Of course, Lily! Let's take a picture to remember the original place we found."
After they took the picture, they continued their walk and saw a bird flying in the sky. Lily said, "MomPyTorchObserver {"prompt_tokens":2,"generated_tokens":125,"model_load_start_ms":1713226585936,"model_load_end_ms":1713226586909,"inference_start_ms":1713226586909,"inference_end_ms":1713226590363,"prompt_eval_end_ms":1713226586966,"first_token_ms":1713226586994,"aggregate_sampling_time_ms":23,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
I 00:00:04.436699 executorch:runner.cpp:414] 	Prompt Tokens: 2    Generated Tokens: 125
I 00:00:04.436703 executorch:runner.cpp:420] 	Model Load Time:		0.973000 (seconds)
I 00:00:04.436732 executorch:runner.cpp:430] 	Total inference time:		3.454000 (seconds)		 Rate: 	36.189925 (tokens/second)
I 00:00:04.436735 executorch:runner.cpp:438] 		Prompt evaluation:	0.057000 (seconds)		 Rate: 	35.087719 (tokens/second)
I 00:00:04.436739 executorch:runner.cpp:449] 		Generated 125 tokens:	3.397000 (seconds)		 Rate: 	36.797174 (tokens/second)
I 00:00:04.436742 executorch:runner.cpp:457] 	Time to first generated token:	0.085000 (seconds)
I 00:00:04.436744 executorch:runner.cpp:464] 	Sampling time over 127 tokens:	0.023000 (seconds)
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
```

Stories model is too small and sensitive to qunatization.
ghstack-source-id: 223199545
exported-using-ghexport

Reviewed By: mergennachin, kirklandsign

Differential Revision: D56119738

fbshipit-source-id: daf5563fe51a677f302e09ae8a9fb80e6bda72c5
(cherry picked from commit 3257c66)
This was referenced Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants