Skip to content

Commit a79a36f

Browse files
committed
Fix doc format.
1 parent 40f6a18 commit a79a36f

File tree

2 files changed

+24
-14
lines changed

2 files changed

+24
-14
lines changed

doc/fluid/design/quantization/fixed_point_quantization.md

Lines changed: 24 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -9,19 +9,19 @@ This document is to design a quantized training framework on Fluid. The first pa
99

1010
There are many ways to quantizate the float value to fixed-point value. For example:
1111

12-
$$ r = min(max(x, a), b)$$
12+
$$ r = min(max(x, a), b)$$
1313
$$ s = \frac{b - a}{n - 1} $$
14-
$$ q = round(\frac{r - a}{s}}) $$
14+
$$ q = \left \lfloor \frac{r - a}{s} \right \rceil $$
1515

16-
where, $x$ is the float value to be quantized, $[a, b]$ is the quantization range, $a$ is the minimum value and $b$ is the maximal value. `round` denotes rounding to the nearest integer. If the quantization level is $m$, $n$ is $2^m$, for example, $m$ is 8 and $n$ is 256. $q$ is the quantized integer.
16+
where, $x$ is the float value to be quantized, $[a, b]$ is the quantization range, $a$ is the minimum value and $b$ is the maximal value. $\left \lfloor \right \rceil$ denotes rounding to the nearest integer. If the quantization level is $k$, $n$ is $2^k$, for example, $k$ is 8 and $n$ is 256. $q$ is the quantized integer.
1717

1818

1919
The quantization we apllied is parameterized by the number of quantization levels and maximum absolute value:
2020

21-
$$ abs_max = max(abs(x)); $$
22-
$$ q = \left \lfloor \frac{x}{abs_max} * (n - 1) \right \rceil $$
21+
$$ M = max(abs(x)) $$
22+
$$ q = \left \lfloor \frac{x}{M} * (n - 1) \right \rceil $$
2323

24-
where, $x$ is the float value to be quantized, $abs_max$ is maximum absolute value. $\left \lfloor \right \rceil$ denotes rounding to the nearest integer. For 8 bit quantization, $n=2^{8}=256$. $q$ is the quantized integer.
24+
where, $x$ is the float value to be quantized, $M$ is maximum absolute value. $\left \lfloor \right \rceil$ denotes rounding to the nearest integer. For 8 bit quantization, $n=2^{8}=256$. $q$ is the quantized integer.
2525

2626

2727
Wether the *min-max* quantization or *max-abs* quantization, they also can be represent:
@@ -37,8 +37,13 @@ How to calculate the quantization range (or maximum absolute value) for inferenc
3737
### Training Framework
3838

3939
The training framework is as following figure.
40+
41+
<p align="center">
4042
<img src="quantization_training_framework.png" align="center"/><br/>
4143

44+
Fig 1. Forward and backward in training.
45+
</p>
46+
4247
#### Forward pass
4348

4449
The forward pass is simulated quantization, see the figure a.
@@ -50,27 +55,32 @@ The forward pass is simulated quantization, see the figure a.
5055

5156
For general matrix to matrix multiplication (GEMM), quantize for $X$ and $W$:
5257

53-
$$ Xq = \left \lfloor \frac{X}{Xm} * (n - 1) \right \rceil $$
54-
$$ Wq = \left \lfloor \frac{W}{Wm} * (n - 1) \right \rceil $$
58+
$$ X_q = \left \lfloor \frac{X}{X_m} * (n - 1) \right \rceil $$
59+
$$ W_q = \left \lfloor \frac{W}{W_m} * (n - 1) \right \rceil $$
5560

5661
Do GEMM:
5762

58-
$$ Y = Xq * Wq $$
63+
$$ Y = X_q * W_q $$
5964

6065

6166
Dequantize $Y$:
6267

6368
$$
64-
Ydq &= \frac{Y}{(n - 1) * (n - 1)} * Xm * Wm \\
65-
&= \frac{Xq * Wq}{(n - 1) * (n - 1)} * Xm * Wm \\
66-
&= \frac{Xq}{n - 1} * Xm * \frac{Wq}{n - 1} * Wm
69+
\begin{align}
70+
Y_{dq} &=\frac{Y}{(n - 1) * (n - 1)} * X_m * W_m \\\
71+
&=\frac{X_q * W_q}{(n - 1) * (n - 1)} * X_m * W_m \\
72+
&=(\frac{X_q}{n - 1} * X_m) * (\frac{W_q}{n - 1} * W_m)
73+
\end{align}
6774
$$
6875

6976
From these formulas, dequantization also can be moved before GEMM, do dequantization for $Xq$ and $Wq$ at first, then do GEMM. The forward workflow in training is equivalent to following framework.
7077

71-
72-
<img src="quantization_forward.png" align="center"/><br/>
78+
<p align="center">
79+
<img src="quantization_forward.png" width="300" height="300" /><br/>
80+
81+
Fig 2. Equitvalent forward in training.
7382

83+
</p>
7484

7585
We use this equivalent workflow in the training. In our desigin, there is a quantization transipler to insert the quantization operator and the de-quantization operator in the Fluid `ProgramDesc`.
7686

30.8 KB
Loading

0 commit comments

Comments
 (0)