Skip to content

Commit ed0bf32

Browse files
authored
readme : modernize (ggml-org#5379)
* first cleanup, update everything to Llama 2 and remove outdated content * Delete SHA256SUMS * make build instructions generic * recommend Q4_K_M quantization method * Update README.md
1 parent 9a697d8 commit ed0bf32

File tree

2 files changed

+36
-131
lines changed

2 files changed

+36
-131
lines changed

README.md

+36-91
Original file line numberDiff line numberDiff line change
@@ -33,17 +33,14 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
3333
<li><a href="#get-the-code">Get the Code</a></li>
3434
<li><a href="#build">Build</a></li>
3535
<li><a href="#blas-build">BLAS Build</a></li>
36-
<li><a href="#prepare-data--run">Prepare Data & Run</a></li>
36+
<li><a href="#prepare-and-quantize">Prepare and Quantize</a></li>
37+
<li><a href="#run-the-quantized-model">Run the quantized model</a></li>
3738
<li><a href="#memorydisk-requirements">Memory/Disk Requirements</a></li>
3839
<li><a href="#quantization">Quantization</a></li>
3940
<li><a href="#interactive-mode">Interactive mode</a></li>
4041
<li><a href="#constrained-output-with-grammars">Constrained output with grammars</a></li>
41-
<li><a href="#instruction-mode-with-alpaca">Instruction mode with Alpaca</a></li>
42-
<li><a href="#using-openllama">Using OpenLLaMA</a></li>
43-
<li><a href="#using-gpt4all">Using GPT4All</a></li>
44-
<li><a href="#using-pygmalion-7b--metharme-7b">Using Pygmalion 7B & Metharme 7B</a></li>
45-
<li><a href="#obtaining-the-facebook-llama-original-model-and-stanford-alpaca-model-data">Obtaining the Facebook LLaMA original model and Stanford Alpaca model data</a></li>
46-
<li><a href="#verifying-the-model-files">Verifying the model files</a></li>
42+
<li><a href="#instruct-mode">Instruct mode</a></li>
43+
<li><a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a></li>
4744
<li><a href="#seminal-papers-and-background-on-the-models">Seminal papers and background on the models</a></li>
4845
<li><a href="#perplexity-measuring-model-quality">Perplexity (measuring model quality)</a></li>
4946
<li><a href="#android">Android</a></li>
@@ -83,20 +80,16 @@ improved significantly thanks to many contributions. It is the main playground f
8380

8481
**Supported models:**
8582

83+
Typically finetunes of the base models below are supported as well.
84+
8685
- [X] LLaMA 🦙
8786
- [x] LLaMA 2 🦙🦙
88-
- [X] [Mistral AI v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1)
87+
- [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
8988
- [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
9089
- [X] Falcon
91-
- [X] [Alpaca](https://github.com/ggerganov/llama.cpp#instruction-mode-with-alpaca)
92-
- [X] [GPT4All](https://github.com/ggerganov/llama.cpp#using-gpt4all)
9390
- [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
9491
- [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
95-
- [X] [Vicuna](https://github.com/ggerganov/llama.cpp/discussions/643#discussioncomment-5533894)
9692
- [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
97-
- [X] [OpenBuddy 🐶 (Multilingual)](https://github.com/OpenBuddy/OpenBuddy)
98-
- [X] [Pygmalion/Metharme](#using-pygmalion-7b--metharme-7b)
99-
- [X] [WizardLM](https://github.com/nlpxucan/WizardLM)
10093
- [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
10194
- [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
10295
- [X] [Starcoder models](https://github.com/ggerganov/llama.cpp/pull/3187)
@@ -166,7 +159,7 @@ Unless otherwise noted these projects are open-source with permissive licensing:
166159

167160
Here is a typical run using LLaMA v2 13B on M2 Ultra:
168161

169-
```java
162+
```
170163
$ make -j && ./main -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
171164
I llama.cpp build info:
172165
I UNAME_S: Darwin
@@ -250,7 +243,7 @@ https://user-images.githubusercontent.com/1991296/224442907-7693d4be-acaa-4e01-8
250243

251244
## Usage
252245

253-
Here are the end-to-end binary build and model conversion steps for the LLaMA-7B model.
246+
Here are the end-to-end binary build and model conversion steps for most supported models.
254247

255248
### Get the Code
256249

@@ -635,7 +628,7 @@ Building the program with BLAS support may lead to some performance improvements
635628
636629
**Without docker**:
637630
638-
Firstly, you need to make sure you installed [Vulkan SDK](https://vulkan.lunarg.com/doc/view/latest/linux/getting_started_ubuntu.html)
631+
Firstly, you need to make sure you have installed [Vulkan SDK](https://vulkan.lunarg.com/doc/view/latest/linux/getting_started_ubuntu.html)
639632
640633
For example, on Ubuntu 22.04 (jammy), use the command below:
641634
@@ -648,6 +641,8 @@ Building the program with BLAS support may lead to some performance improvements
648641
vulkaninfo
649642
```
650643
644+
Alternatively your package manager might be able to provide the appropiate libraries. For example for Ubuntu 22.04 you can install `libvulkan-dev` instead.
645+
651646
Then, build llama.cpp using the cmake command below:
652647
653648
```bash
@@ -662,34 +657,42 @@ Building the program with BLAS support may lead to some performance improvements
662657
# ggml_vulkan: Using Intel(R) Graphics (ADL GT2) | uma: 1 | fp16: 1 | warp size: 32
663658
```
664659
665-
### Prepare Data & Run
660+
### Prepare and Quantize
661+
662+
To obtain the official LLaMA 2 weights please see the <a href="#obtaining-and-using-the-facebook-llama-2-model">Obtaining and using the Facebook LLaMA 2 model</a> section. There is also a large selection of pre-quantized `gguf` models available on Hugging Face.
666663
667664
```bash
668-
# obtain the original LLaMA model weights and place them in ./models
665+
# obtain the official LLaMA model weights and place them in ./models
669666
ls ./models
670-
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model
667+
llama-2-7b tokenizer_checklist.chk tokenizer.model
671668
# [Optional] for models using BPE tokenizers
672669
ls ./models
673-
65B 30B 13B 7B vocab.json
670+
<folder containing weights and tokenizer json> vocab.json
671+
# [Optional] for PyTorch .bin models like Mistral-7B
672+
ls ./models
673+
<folder containing weights and tokenizer json>
674674
675675
# install Python dependencies
676676
python3 -m pip install -r requirements.txt
677677
678-
# convert the 7B model to ggml FP16 format
679-
python3 convert.py models/7B/
678+
# convert the model to ggml FP16 format
679+
python3 convert.py models/mymodel/
680680
681681
# [Optional] for models using BPE tokenizers
682-
python convert.py models/7B/ --vocabtype bpe
682+
python convert.py models/mymodel/ --vocabtype bpe
683683
684-
# quantize the model to 4-bits (using q4_0 method)
685-
./quantize ./models/7B/ggml-model-f16.gguf ./models/7B/ggml-model-q4_0.gguf q4_0
684+
# quantize the model to 4-bits (using Q4_K_M method)
685+
./quantize ./models/mymodel/ggml-model-f16.gguf ./models/mymodel/ggml-model-Q4_K_M.gguf Q4_K_M
686686
687-
# update the gguf filetype to current if older version is unsupported by another application
688-
./quantize ./models/7B/ggml-model-q4_0.gguf ./models/7B/ggml-model-q4_0-v2.gguf COPY
687+
# update the gguf filetype to current version if older version is now unsupported
688+
./quantize ./models/mymodel/ggml-model-Q4_K_M.gguf ./models/mymodel/ggml-model-Q4_K_M-v2.gguf COPY
689+
```
689690
691+
### Run the quantized model
690692
691-
# run the inference
692-
./main -m ./models/7B/ggml-model-q4_0.gguf -n 128
693+
```bash
694+
# start inference on a gguf model
695+
./main -m ./models/mymodel/ggml-model-Q4_K_M.gguf -n 128
693696
```
694697
695698
When running the larger models, make sure you have enough disk space to store all the intermediate files.
@@ -710,7 +713,7 @@ From the unzipped folder, open a terminal/cmd window here and place a pre-conver
710713
711714
As the models are currently fully loaded into memory, you will need adequate disk space to save them and sufficient RAM to load them. At the moment, memory and disk requirements are the same.
712715
713-
| Model | Original size | Quantized size (4-bit) |
716+
| Model | Original size | Quantized size (Q4_0) |
714717
|------:|--------------:|-----------------------:|
715718
| 7B | 13 GB | 3.9 GB |
716719
| 13B | 24 GB | 7.8 GB |
@@ -826,9 +829,9 @@ The `grammars/` folder contains a handful of sample grammars. To write your own,
826829
827830
For authoring more complex JSON grammars, you can also check out https://grammar.intrinsiclabs.ai/, a browser app that lets you write TypeScript interfaces which it compiles to GBNF grammars that you can save for local use. Note that the app is built and maintained by members of the community, please file any issues or FRs on [its repo](http://github.com/intrinsiclabsai/gbnfgen) and not this one.
828831
829-
### Instruction mode with Alpaca
832+
### Instruct mode
830833
831-
1. First, download the `ggml` Alpaca model into the `./models` folder
834+
1. First, download and place the `ggml` model into the `./models` folder
832835
2. Run the `main` tool like this:
833836
834837
```
@@ -854,50 +857,6 @@ cadaver, cauliflower, cabbage (vegetable), catalpa (tree) and Cailleach.
854857
>
855858
```
856859
857-
### Using [OpenLLaMA](https://github.com/openlm-research/open_llama)
858-
859-
OpenLLaMA is an openly licensed reproduction of Meta's original LLaMA model. It uses the same architecture and is a drop-in replacement for the original LLaMA weights.
860-
861-
- Download the [3B](https://huggingface.co/openlm-research/open_llama_3b), [7B](https://huggingface.co/openlm-research/open_llama_7b), or [13B](https://huggingface.co/openlm-research/open_llama_13b) model from Hugging Face.
862-
- Convert the model to ggml FP16 format using `python convert.py <path to OpenLLaMA directory>`
863-
864-
### Using [GPT4All](https://github.com/nomic-ai/gpt4all)
865-
866-
*Note: these instructions are likely obsoleted by the GGUF update*
867-
868-
- Obtain the `tokenizer.model` file from LLaMA model and put it to `models`
869-
- Obtain the `added_tokens.json` file from Alpaca model and put it to `models`
870-
- Obtain the `gpt4all-lora-quantized.bin` file from GPT4All model and put it to `models/gpt4all-7B`
871-
- It is distributed in the old `ggml` format which is now obsoleted
872-
- You have to convert it to the new format using `convert.py`:
873-
874-
```bash
875-
python3 convert.py models/gpt4all-7B/gpt4all-lora-quantized.bin
876-
```
877-
878-
- You can now use the newly generated `models/gpt4all-7B/ggml-model-q4_0.bin` model in exactly the same way as all other models
879-
880-
- The newer GPT4All-J model is not yet supported!
881-
882-
### Using Pygmalion 7B & Metharme 7B
883-
884-
- Obtain the [LLaMA weights](#obtaining-the-facebook-llama-original-model-and-stanford-alpaca-model-data)
885-
- Obtain the [Pygmalion 7B](https://huggingface.co/PygmalionAI/pygmalion-7b/) or [Metharme 7B](https://huggingface.co/PygmalionAI/metharme-7b) XOR encoded weights
886-
- Convert the LLaMA model with [the latest HF convert script](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py)
887-
- Merge the XOR files with the converted LLaMA weights by running the [xor_codec](https://huggingface.co/PygmalionAI/pygmalion-7b/blob/main/xor_codec.py) script
888-
- Convert to `ggml` format using the `convert.py` script in this repo:
889-
```bash
890-
python3 convert.py pygmalion-7b/ --outtype q4_1
891-
```
892-
> The Pygmalion 7B & Metharme 7B weights are saved in [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) precision. If you wish to convert to `ggml` without quantizating, please specify the `--outtype` as `f32` instead of `f16`.
893-
894-
895-
### Obtaining the Facebook LLaMA original model and Stanford Alpaca model data
896-
897-
- **Under no circumstances should IPFS, magnet links, or any other links to model downloads be shared anywhere in this repository, including in issues, discussions, or pull requests. They will be immediately deleted.**
898-
- The LLaMA models are officially distributed by Facebook and will **never** be provided through this repository.
899-
- Refer to [Facebook's LLaMA repository](https://github.com/facebookresearch/llama/pull/73/files) if you need to request access to the model data.
900-
901860
### Obtaining and using the Facebook LLaMA 2 model
902861
903862
- Refer to [Facebook's LLaMA download page](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) if you want to access the model data.
@@ -909,20 +868,6 @@ python3 convert.py pygmalion-7b/ --outtype q4_1
909868
- [LLaMA 2 13B chat](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF)
910869
- [LLaMA 2 70B chat](https://huggingface.co/TheBloke/Llama-2-70B-chat-GGUF)
911870
912-
### Verifying the model files
913-
914-
Please verify the [sha256 checksums](SHA256SUMS) of all downloaded model files to confirm that you have the correct model data files before creating an issue relating to your model files.
915-
- The following python script will verify if you have all possible latest files in your self-installed `./models` subdirectory:
916-
917-
```bash
918-
# run the verification script
919-
./scripts/verify-checksum-models.py
920-
```
921-
922-
- On linux or macOS it is also possible to run the following commands to verify if you have all possible latest files in your self-installed `./models` subdirectory:
923-
- On Linux: `sha256sum --ignore-missing -c SHA256SUMS`
924-
- on macOS: `shasum -a 256 --ignore-missing -c SHA256SUMS`
925-
926871
### Seminal papers and background on the models
927872
928873
If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:

SHA256SUMS

-40
This file was deleted.

0 commit comments

Comments
 (0)