Skip to content

[ET-VK] Improve packing format for int4 linear operator + misc improvements #9949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 7, 2025

Conversation

SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Apr 7, 2025

This PR was created by the merge bot to help merge the original PR into the main branch.
ghstack PR number: #9883 by @SS-JIA
^ Please use this as the source of truth for the PR details, comments, and reviews
ghstack PR base: https://github.com/pytorch/executorch/tree/gh/SS-JIA/206/base
ghstack PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/206/head
Merge bot PR base: https://github.com/pytorch/executorch/tree/main
Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/SS-JIA/206/orig
@diff-train-skip-merge

…ements

Pull Request resolved: #9883

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```
ghstack-source-id: 276566116
@exported-using-ghexport

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)
Copy link

pytorch-bot bot commented Apr 7, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9949

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 7, 2025
@kirklandsign kirklandsign added the release notes: vulkan Changes to the Vulkan backend delegate label Apr 7, 2025
@kirklandsign kirklandsign merged commit 031459f into main Apr 7, 2025
79 of 81 checks passed
@kirklandsign kirklandsign deleted the gh/SS-JIA/206/orig branch April 7, 2025 22:54
kirklandsign pushed a commit that referenced this pull request Apr 11, 2025
…ements (#9949)

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)
keyprocedure pushed a commit to keyprocedure/executorch that referenced this pull request Apr 21, 2025
…ements (pytorch#9949)

## Context

Improve performance of the quantized int4 linear shader by packing the scales and zeros tensor, as well as the weight tensor in a more optimal way.

See the comments in the `pack_int4_linear_weight_transposed_interleave` shader for more details about how the new packing works.

## Changes

* Split int8 quantized linear and int4 quantized linear into separate C++ files for better code organization
* Introduce packing shader for int4 weights
* Update int4 linear shader to account for packed weights

## Impact

This change massively improves the performance of the weight int4 quantized linear operator.

With this change, running LLaMa 3.2 1B can now achieve 10 tok/s, from 0.9 tok/s on an Adreno 740. This is a 10x improvement!

With this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 332.3 MB/s (74692800 bytes in 0.214s)
I 00:00:00.003353 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003533 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003563 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003685 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003747 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003799 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003852 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003902 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003976 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004289 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:04.841690 executorch:runner.cpp:101] Reading metadata from model
I 00:00:04.841808 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:04.841830 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:04.841851 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:04.841874 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:04.841893 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:04.841909 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:04.841927 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:04.841945 executorch:runner.cpp:133] eos_id = 128009
I 00:00:04.841951 executorch:runner.cpp:133] eos_id = 128001
I 00:00:04.841963 executorch:runner.cpp:188] RSS after loading model: 2229.828125 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:06.239633 executorch:runner.cpp:258] RSS after prompt prefill: 2229.828125 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:00:17.699086 executorch:runner.cpp:272] RSS after finishing text generation: 2229.828125 MiB (0 if unsupported)
I 00:00:17.699155 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:00:17.699161 executorch:stats.h:114]       Model Load Time:                4.837000 (seconds)
I 00:00:17.699165 executorch:stats.h:124]       Total inference time:           12.857000 (seconds)              Rate:  8.788987 (tokens/second)
I 00:00:17.699168 executorch:stats.h:132]               Prompt evaluation:      1.398000 (seconds)               Rate:  10.014306 (tokens/second)
I 00:00:17.699171 executorch:stats.h:143]               Generated 113 tokens:   11.459000 (seconds)              Rate:  9.861244 (tokens/second)
I 00:00:17.699174 executorch:stats.h:151]       Time to first generated token:  1.398000 (seconds)
I 00:00:17.699177 executorch:stats.h:158]       Sampling time over 127 tokens:  549246500.843000 (seconds)
```

Before this change:

```
/home/ssjia/scratch/bin/app_bin: 1 file pushed, 0 skipped. 302.0 MB/s (74637464 bytes in 0.236s)
I 00:00:00.003050 executorch:cpuinfo_utils.cpp:62] Reading file /sys/devices/soc0/image_version
I 00:00:00.003200 executorch:cpuinfo_utils.cpp:78] Failed to open midr file /sys/devices/soc0/image_version
I 00:00:00.003226 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu0/regs/identification/midr_el1
I 00:00:00.003337 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu1/regs/identification/midr_el1
I 00:00:00.003396 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu2/regs/identification/midr_el1
I 00:00:00.003449 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu3/regs/identification/midr_el1
I 00:00:00.003502 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu4/regs/identification/midr_el1
I 00:00:00.003553 executorch:cpuinfo_utils.cpp:91] Reading file /sys/devices/system/cpu/cpu5/regs/identification/midr_el1
I 00:00:00.003629 executorch:main.cpp:69] Resetting threadpool with num threads = 6
I 00:00:00.004075 executorch:runner.cpp:68] Creating LLaMa runner: model_path=/data/local/tmp/llama3-1b/vk/llama3.pte, tokenizer_path=/data/local/tmp/tokenizer.model
I 00:00:05.417531 executorch:runner.cpp:101] Reading metadata from model
I 00:00:05.417647 executorch:runner.cpp:126] Metadata: get_vocab_size = 128256
I 00:00:05.417669 executorch:runner.cpp:126] Metadata: get_bos_id = 128000
I 00:00:05.417698 executorch:runner.cpp:126] Metadata: use_sdpa_with_kv_cache = 1
I 00:00:05.417716 executorch:runner.cpp:126] Metadata: use_kv_cache = 1
I 00:00:05.417735 executorch:runner.cpp:126] Metadata: get_max_context_len = 128
I 00:00:05.417751 executorch:runner.cpp:126] Metadata: get_max_seq_len = 128
I 00:00:05.417768 executorch:runner.cpp:126] Metadata: enable_dynamic_shape = 0
I 00:00:05.417787 executorch:runner.cpp:133] eos_id = 128009
I 00:00:05.417793 executorch:runner.cpp:133] eos_id = 128001
I 00:00:05.417808 executorch:runner.cpp:188] RSS after loading model: 2230.812500 MiB (0 if unsupported)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>Tell me a short story.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I 00:00:19.689616 executorch:runner.cpp:258] RSS after prompt prefill: 2230.812500 MiB (0 if unsupported)
Here's a short story for you:

**The Library of Lost Memories**

In a small, dusty town nestled between two great rivers, there was a library that held the secrets of the past. It was a place where memories were stored, not retrieved, and the librarians were the guardians of the past.

The library was called the Library of Lost Memories, and it was said that anyone who entered its doors would be given a glimpse into the memories of those who had come before. The librarians were wise and kind, and they would only allow those who wereI 00:02:15.269693 executorch:runner.cpp:272] RSS after finishing text generation: 2230.812500 MiB (0 if unsupported)
I 00:02:15.269810 executorch:stats.h:108]       Prompt Tokens: 14    Generated Tokens: 113
I 00:02:15.269825 executorch:stats.h:114]       Model Load Time:                5.414000 (seconds)
I 00:02:15.269832 executorch:stats.h:124]       Total inference time:           129.852000 (seconds)             Rate:  0.870221 (tokens/second)
I 00:02:15.269837 executorch:stats.h:132]               Prompt evaluation:      14.271000 (seconds)              Rate:  0.981010 (tokens/second)
I 00:02:15.269841 executorch:stats.h:143]               Generated 113 tokens:   115.581000 (seconds)             Rate:  0.977669 (tokens/second)
I 00:02:15.269844 executorch:stats.h:151]       Time to first generated token:  14.271000 (seconds)
I 00:02:15.269847 executorch:stats.h:158]       Sampling time over 127 tokens:  549711269.115000 (seconds)

PyTorchObserver {"prompt_tokens":14,"generated_tokens":113,"model_load_start_ms":1743712527974,"model_load_end_ms":1743712533388,"inference_start_ms":1743712533388,"inference_end_ms":1743712663240,"prompt_eval_end_ms":1743712547659,"first_token_ms":1743712547659,"aggregate_sampling_time_ms":549711269115,"SCALING_FACTOR_UNITS_PER_SECOND":1000}
```

Differential Revision: [D72412950](https://our.internmc.facebook.com/intern/diff/D72412950/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: vulkan Changes to the Vulkan backend delegate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants