-
Notifications
You must be signed in to change notification settings - Fork 259
How to calibrate a w8a8 quantized model? #1002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
|
Got it, thanks for the reply! |
I used the following code to test the performance of w8a8. @torch.no_grad()
def generate(model, tokenizer, device, prompt, max_new_tokens):
inputs = tokenizer(prompt, return_tensors="pt", padding=True)
start = time.time()
outputs = model.generate(
input_ids=inputs.input_ids.to(device),
max_new_tokens=max_new_tokens,
attention_mask=inputs.attention_mask.to(device),
do_sample=True,
top_k=50,
top_p=0.9,
)
end = time.time()
generated_text = tokenizer.decode(outputs[0])
print(f"Generated '{generated_text}' in [{end - start:.2f} s]") But I encountered performance issues. I tested on an Intel CPU. Under the same prompt, Huggingface FP16 takes 3 seconds to complete, but the quantized model takes 60 seconds to compute. Am I missing any steps? |
@chenghuaWang can you try running |
Unfortunately, after using
I will test it on some accelerators. Thank you for your answer. |
I used the following code to quantize an LLM, employing an w8a8 quantization setting:
Everything is running smoothly, but the model's accuracy has decreased significantly. How can I calibrate a quantized model to enhance its accuracy?
I have another question:
I printed out a parameter and noticed that the weights were quantized using per-channel quantization. What is the purpose of the fp16 AffineQuantizedTensor? Shouldn't the activation only require one scale parameter when using per-tensor quantization?
I'm not very familiar with the quantization mechanism in PyTorch, and I hope you can give me some tips.
The text was updated successfully, but these errors were encountered: