-
Notifications
You must be signed in to change notification settings - Fork 366
Description
Hi there,
I'm new to quantization. From my understanding, "8da4w" means that the weights are pre-quantized to 4 bits, and the activations are quantized to 8 bits at runtime. Following this, the GEMM (General Matrix Multiply) operation between weights and activations is computed in the int8 data type. Do I have this correct?
However, I'm confused by the code for Int8DynActInt4WeightQuantizer. The forward method of Int8DynActInt4WeightLinear calls a method named per_token_dynamic_quant, which can be found here. In this method, the input is first quantized to int8 and then immediately converted back to its original data type without further processing. I don't understand the purpose of this function. Furthermore, I have launched a program using Int8DynActInt4WeightQuantizer and observed the data types of x and w_dq in the method linear_forward_8da4w, which can be found here, they both are float32. This seems to contradict my understanding of the computations involved in '8da4w'.
I realize that I'm likely missing some fundamental aspects of dynamic quantization. Could anyone kindly clarify this process for me?
Thank you!