Skip to content

Commit c625680

Browse files
hsubramonyregisss
authored andcommitted
Update text-gen README.md to add auto-gptq fork install steps (huggingface#1442)
1 parent df6b919 commit c625680

File tree

1 file changed

+5
-1
lines changed

1 file changed

+5
-1
lines changed

examples/text-generation/README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,7 @@ You will also need to add `--torch_compile` and `--parallel_strategy="tp"` in yo
282282
Here is an example:
283283
```bash
284284
PT_ENABLE_INT64_SUPPORT=1 PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py --world_size 8 run_generation.py \
285-
--model_name_or_path meta-llama/Llama-2-70b-hf \
285+
--model_name_or_path meta-llama/Llama-2-7b-hf \
286286
--trim_logits \
287287
--use_kv_cache \
288288
--attn_softmax_bf16 \
@@ -593,6 +593,10 @@ For more details see [documentation](https://docs.habana.ai/en/latest/PyTorch/Mo
593593
Llama2-7b in UINT4 weight only quantization is enabled using [AutoGPTQ Fork](https://github.com/HabanaAI/AutoGPTQ), which provides quantization capabilities in PyTorch.
594594
Currently, the support is for UINT4 inference of pre-quantized models only.
595595

596+
```bash
597+
BUILD_CUDA_EXT=0 python -m pip install -vvv --no-build-isolation git+https://github.com/HabanaAI/AutoGPTQ.git
598+
```
599+
596600
You can run a *UINT4 weight quantized* model using AutoGPTQ by setting the following environment variables:
597601
`SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=true` before running the command,
598602
and by adding the argument `--load_quantized_model_with_autogptq`.

0 commit comments

Comments
 (0)