Skip to content

Commit fdd266c

Browse files
pytorchbotlucylq
andauthored
llama2 readme (#3315) (#3326)
Summary: - add note for embedding quantize, for llama3 - re-order export args to be the same as llama2, group_size missing `--` Pull Request resolved: #3315 Reviewed By: cccclai Differential Revision: D56528535 Pulled By: lucylq fbshipit-source-id: 4453070339ebdb3d782b45f96fe43d28c7006092 (cherry picked from commit 34f59ed) Co-authored-by: Lucy Qiu <[email protected]>
1 parent 2d75a0b commit fdd266c

File tree

1 file changed

+12
-0
lines changed

1 file changed

+12
-0
lines changed

examples/models/llama2/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,18 @@ If you want to deploy and run a smaller model for educational purposes. From `ex
100100
python -m examples.models.llama2.tokenizer.tokenizer -t tokenizer.model -o tokenizer.bin
101101
```
102102
103+
### Option C: Download and export Llama3 8B model
104+
105+
You can export and run the original Llama3 8B model.
106+
107+
1. Llama3 pretrained parameters can be downloaded from [Meta's official llama3 repository](https://github.com/meta-llama/llama3/).
108+
109+
2. Export model and generate `.pte` file
110+
```
111+
python -m examples.models.llama2.export_llama --checkpoint <consolidated.00.pth> -p <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32 --metadata '{"get_bos_id":128000, "get_eos_id":128001}' --embedding-quantize 4,32 --output_name="llama3_kv_sdpa_xnn_qe_4_32.pte"
112+
```
113+
114+
Due to the larger vocabulary size of Llama3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` to further reduce the model size.
103115
104116
## (Optional) Finetuning
105117

0 commit comments

Comments
 (0)