@@ -32,15 +32,19 @@ Note that groupsize less than 128 was not enabled, since such model were still t
32
32
33
33
## Performance
34
34
35
- Performance was measured on Samsung Galaxy S22, S23, S24 and One Plus 12. Measurement performance is in terms of tokens/second.
35
+ Performance was measured on Samsung Galaxy S22, S24, One Plus 12 and iPhone 15 max Pro . Measurement performance is in terms of tokens/second.
36
36
37
37
|Device | Groupwise 4-bit (128) | Groupwise 4-bit (256)
38
38
|--------| ---------------------- | ---------------
39
- | Galaxy S22 | 8.15 tokens/second | 8.3 tokens/second |
40
- | Galaxy S24 | 10.66 tokens/second | 11.26 tokens/second |
41
- | One plus 12 | 11.55 tokens/second | 11.6 tokens/second |
42
- | iPhone 15 pro | x | x |
39
+ | Galaxy S22* | 8.15 tokens/second | 8.3 tokens/second |
40
+ | Galaxy S24* | 10.66 tokens/second | 11.26 tokens/second |
41
+ | One plus 12* | 11.55 tokens/second | 11.6 tokens/second |
42
+ | Galaxy S22** | 5.5 tokens/second | 5.9 tokens/second |
43
+ | iPhone 15 pro** | ~ 6 tokens/second | ~ 6 tokens/second |
43
44
45
+ * : Measured via adb binary based [ workflow] ( #Step-5:-Run-benchmark-on-Android-phone )
46
+
47
+ ** : Measured via app based [ workflow] ( #Step-6:-Build-Mobile-apps )
44
48
45
49
# Instructions
46
50
@@ -241,7 +245,6 @@ Please refer to [this tutorial](https://pytorch.org/executorch/main/llm/llama-de
241
245
- Enabling LLama2 7b and other architectures via Vulkan
242
246
- Enabling performant execution of widely used quantization schemes.
243
247
244
- TODO
245
248
246
249
# Notes
247
250
This example tries to reuse the Python code, with minimal modifications to make it compatible with current ExecuTorch:
0 commit comments