Reword, address comments

qihqi · qihqi · commit 7d991236d4f0 · 2024-09-06T16:37:18.000Z
diff --git a/README.md b/README.md
@@ -78,18 +78,18 @@ mistralai/Mixtral-8x7B-Instruct-v0.1
 To run jetstream-pytorch server with one model:
 
 ```
-jpt serve --model_id --model_id meta-llama/Meta-Llama-3-8B-Instruct
+jpt serve --model_id meta-llama/Meta-Llama-3-8B-Instruct
 ```
 
-If it the first time you run this model, it will download weights from 
+If it's the first time you run this model, it will download weights from 
 HuggingFace. 
 
 HuggingFace's Llama3 weights are gated, so you need to either run 
 `huggingface-cli login` to set your token, OR, pass your hf_token explicitly.
 
-To pass hf token, add `--hf_token` flag
+To pass hf token explicitly, add `--hf_token` flag
 ```
-jpt serve --model_id --model_id meta-llama/Meta-Llama-3-8B-Instruct --hf_token=...
+jpt serve --model_id meta-llama/Meta-Llama-3-8B-Instruct --hf_token=...
 ```
 
 To login using huggingface hub, run:
@@ -109,6 +109,13 @@ Quantization will be done on the flight as the weight loads.
 Weights downloaded from HuggingFace will be stored by default in `checkpoints` folder.
 in the place where `jpt` is executed.
 
+You can change where the weights are stored with `--working_dir` flag.
+
+If you wish to use your own checkpoint, then, place them inside 
+of the `checkpoints/<org>/<model>/hf_original` dir (or the corresponding subdir in `--working_dir`). For example,
+Llama3 checkpoints will be at `checkpoints/meta-llama/Llama-2-7b-hf/hf_original/*.safetensors`. You can replace these files with modified
+weights in HuggingFace format. 
+
 
 # Run the server with ray
 Below are steps run server with ray:
diff --git a/jetstream_pt/fetch_models.py b/jetstream_pt/fetch_models.py
@@ -39,7 +39,7 @@ class ModelInfo:
   # information needed to allocate cache
   num_layers: int
   # number of kv heads
-  num_heads: int
+  num_kv_heads: int
   
   head_dim: int
   n_reps: int  # repeatition for GQA
@@ -139,7 +139,7 @@ def construct_env_data_from_model_id(
   )
   env_data.cache_shape = (
       batch_size,
-      model_info.num_heads,
+      model_info.num_kv_heads,
       max_cache_length,
       model_info.head_dim,
   )