-
Notifications
You must be signed in to change notification settings - Fork 260
int4wo can't use same packed weight for cpu and cuda #1117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @yanbing-j |
int4 woq indeed cannot use the same packed weight for CPU (AVX2/AVX512) and CUDA because of the different packed methods between different ISAs and devices. We raised a PR pytorch/pytorch#129940, which uses common serialized layout ( |
@yanbing-j that does not help with solving the problem I think, what we want is to do a device conversion after |
@jerryzh168 cc @mingfeima Mingfei, do you have any other comments about this issue? |
If the packing is specific to device, I feel the most natural way to structure it is to define them with different layout? packing format (or layout) in principle should not be specific to device actually, since it just talks about how we rearrange the tensor value, people should be able to implement the same algorithm in different devices, even you only use that format on a specific device |
@jerryzh168 i agree that packing format is non specific to device, that's exactly why we want to decouple it. the thing is , on CUDA, int4 is packed to a very special format: i assume that your concept of non-device-specific-packing-layout refers to |
@mingfeima I think using quantize:
inference
I believe the introduction of Basically: conversion between these layout should be explicit in the user code, or we can have a util for that I think. would this work for you? |
I am afraid this won't work. This is actually the first version of my int4 weight only quantization patch on CPU, I tried to use different output shape for CPU and CUDA. See pytorch/pytorch@30befa5#diff-6dffed1ade0ba3e887f9a4eafa3bfcec267ab2365b8adcb91bd391f49b3fd2e3R3429 the problem comes from torch/_meta_registrations.py which requires OP has identical output shape for multiple devices (Maybe I am not precise here, I can't remember the details, has been too long). |
so it's reasonable that the same op should output the same shape, but I'm talking about different ops here though, would that work? is "CPU and CUDA using the same packing op" a requirement? |
if you think using different packing ops for CPU and CUDA is OK, it certainly works for us. |
@mingfeima yeah I think so, could you or @yanbing-j help with refactoring cpu packing and int4mm to different ops? |
Hi @jerryzh168 , As for your request of seperating cpu packing and cuda packing into different ops, we can do this like the following figure. This figure is not using However, this behavior still cannot solve the initial repro of this issue, to cc @mingfeima |
yeah the proposed format for cpu packed weight looks good, also the packing op should be talking about layout instead of device I think.
Yeah it's correct that we are never going to support just changing the packed weight for tinygemm from cuda to cpu and expect that to work for cpu. The proposed API is that: We have two layouts:
|
@jerryzh168 Thanks for the confirmation! I draft a PR pytorch/pytorch#139611 to split int4wo weight packing. For convience, I set the input cc @mingfeima . |
@yanbing-j thanks, why do you use |
@jerryzh168 I suppose cc @mingfeima Do you have any other comments? |
OK, if there is no extra cost then that's fine, just want to say that we don't need |
sounds good to me ~ |
@jerryzh168 @mingfeima Thanks for the comments! Please review pytorch/pytorch#139611 and I will fix CI failures simultaneously. And will also include Nikita and Sanchit when PR is ready. |
Fixes pytorch/ao#1117. This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on pytorch/ao#1117 (comment). Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8. Pull Request resolved: pytorch#139611 Approved by: https://github.com/jerryzh168
Fixes pytorch/ao#1117. This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on pytorch/ao#1117 (comment). Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8. Pull Request resolved: pytorch#139611 Approved by: https://github.com/jerryzh168
Fixes pytorch/ao#1117. This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on pytorch/ao#1117 (comment). Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8. Pull Request resolved: pytorch#139611 Approved by: https://github.com/jerryzh168
Fixes pytorch/ao#1117. This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on pytorch/ao#1117 (comment). Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8. Pull Request resolved: pytorch#139611 Approved by: https://github.com/jerryzh168
Fixes pytorch/ao#1117. This PR is to seperate int4wo weight packing between CPU and other devices, to help implement `INT4CPULayout` in torchao based on pytorch/ao#1117 (comment). Now, for CPU, the input `weight` of `_convert_weight_to_int4pack_for_cpu` is [n, k] int32, output is [n, k / 2] uint8. The input packed weight of `_weight_int4pack_mm_for_cpu` is [n, k / 2] uint8. Pull Request resolved: pytorch#139611 Approved by: https://github.com/jerryzh168
This is mostly to keep track of this problem which has been around for a while
if you ever do something like 1)quantize cpu model with int4, 2)move it to cuda
then the output of the model will be nonsense.
e.g. if in https://github.com/pytorch/ao/blob/main/torchao/_models/llama/generate.py#L231
you did
the output of hte model is nonsensical
Hello, my name is♠ zewnętrz zewnętrz@{ zewnętrz zewnętrz zewnętrz))] ord zewnętrzŻ zewnętrz zewnętrz zewnętrz zewnętrzŻ zewnętrz zewnętrz Хронологи
because it simply changes same packed weight from cpu to cuda without addressing teh fact that the format is numerically different for each backend
https://github.com/pytorch/pytorch/blob/912ea5601bb3e7d360202927cb2de1ddc1d72cf6/aten/src/ATen/native/native_functions.yaml#L4144-L4148
despite the different packing paths there's no metadata to detect which backend packing algorithm was actually used so can't even error out intelligently.
We could manually keep track of this in affine quantized tensor and having code to unpack and repack if someone calls .to(device) but it doesn't fully solve the issue because again, we can't detect it. Users can do stuff like serialize the model on cuda, reload on cpu and we're in the same situation because when you try to do .cuda() you would want to unpack->repack but would use the cpu unpacking which wont work since hte original packing was done on cuda. You'd have to further add a field to keep track of which device the packed weight was most recently packed in and if someone tries to do .to("device") you have to check what the original device was, and if its different from the current device then you first move it before the unpack->repack. We should either implement such a solution or identify whether this is going to be rectified in some other way.
small repro:
The text was updated successfully, but these errors were encountered: