Description
Hi there,
I'm new to quantization. From my understanding, "8da4w" means that the weights are pre-quantized to 4 bits, and the activations are quantized to 8 bits at runtime. Following this, the GEMM (General Matrix Multiply) operation between weights and activations is computed in the int8
data type. Do I have this correct?
However, I'm confused by the code for Int8DynActInt4WeightQuantizer
. The forward
method of Int8DynActInt4WeightLinear
calls a method named per_token_dynamic_quant
, which can be found here. In this method, the input is first quantized to int8
and then immediately converted back to its original data type without further processing. I don't understand the purpose of this function. Furthermore, I have launched a program using Int8DynActInt4WeightQuantizer
and observed the data types of x
and w_dq
in the method linear_forward_8da4w
, which can be found here, they both are float32
. This seems to contradict my understanding of the computations involved in '8da4w'.
I realize that I'm likely missing some fundamental aspects of dynamic quantization. Could anyone kindly clarify this process for me?
Thank you!