Skip to content

Understanding 8da4w #430

Closed
Closed
@DzAvril

Description

@DzAvril

Hi there,

I'm new to quantization. From my understanding, "8da4w" means that the weights are pre-quantized to 4 bits, and the activations are quantized to 8 bits at runtime. Following this, the GEMM (General Matrix Multiply) operation between weights and activations is computed in the int8 data type. Do I have this correct?

However, I'm confused by the code for Int8DynActInt4WeightQuantizer. The forward method of Int8DynActInt4WeightLinear calls a method named per_token_dynamic_quant, which can be found here. In this method, the input is first quantized to int8 and then immediately converted back to its original data type without further processing. I don't understand the purpose of this function. Furthermore, I have launched a program using Int8DynActInt4WeightQuantizer and observed the data types of x and w_dq in the method linear_forward_8da4w, which can be found here, they both are float32. This seems to contradict my understanding of the computations involved in '8da4w'.

I realize that I'm likely missing some fundamental aspects of dynamic quantization. Could anyone kindly clarify this process for me?

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions