Understanding 8da4w

Hi there,

I'm new to quantization. From my understanding, "8da4w" means that the weights are pre-quantized to 4 bits, and the activations are quantized to 8 bits at runtime. Following this, the GEMM (General Matrix Multiply) operation between weights and activations is computed in the `int8` data type. Do I have this correct?

However, I'm confused by the code for `Int8DynActInt4WeightQuantizer`. The `forward` method of `Int8DynActInt4WeightLinear` calls a method named `per_token_dynamic_quant`, which can be found [here](https://github.com/pytorch/ao/blob/fd9f95d614fa03f09d85d73a2c2740cc647d7b9b/torchao/quantization/utils.py#L436-L458). In this method, the input is first quantized to `int8` and then immediately converted back to its original data type without further processing. I don't understand the purpose of this function. Furthermore, I have launched a program using `Int8DynActInt4WeightQuantizer ` and observed the data types of `x` and `w_dq` in the method `linear_forward_8da4w`, which can be found [here](https://github.com/pytorch/ao/blob/fd9f95d614fa03f09d85d73a2c2740cc647d7b9b/torchao/quantization/GPTQ.py#L800), they both are `float32`. This seems to contradict my understanding of the computations involved in '8da4w'.

I realize that I'm likely missing some fundamental aspects of dynamic quantization. Could anyone kindly clarify this process for me?

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Understanding 8da4w #430

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Understanding 8da4w #430

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions