Skip to content

Changes to ingest-qonnx #461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 2, 2021
Merged

Changes to ingest-qonnx #461

merged 14 commits into from
Dec 2, 2021

Conversation

thesps
Copy link
Contributor

@thesps thesps commented Nov 25, 2021

@jmitrevs some updates for you.

Firstly, I've pulled master branch to bring this up to date.

Secondly, the main thing I've changed is that Quant nodes don't get converted to BatchNormalization any more.

Now, a Quant node with a Constant node 0th input is replaced with a Constant. It's basically the same logic as what you previously did, but instead of the node going through transformations like Quant to BatchNormalization to Constant, it just goes Quant to Constant. I'm not 100% sure that scale and zeropt are handled properly, but I haven't changed the behaviour of that just yet.

A Quant node with something that is not a Constant node as its 0th input is replaced with an Activation (linear). If one of these Quant nodes has a scale or zeropt, an ApplyAlpha (aka BatchNormalization) is inserted to take care of that. Again, some more verification is needed that we're handling those correctly.

I've also added a test_qonnx.py with a test of the TFC_2w2a model that works locally, but not yet in CI because I think I messed up the environment. I'll fix that...

jmduarte and others added 13 commits October 11, 2021 16:09
* yaml.safe_load instead of yaml.load

* Use yaml.safe_load in converters __init__.py
* Update `zcu102` and `pynq-z2` `axi-stream` driver
…443)

* fix 2 reshape issues: don't reshape streams for flatten and remove final reshape

* Add a test for a model with Reshape as the final layer

* swap

* only remove for io_parallel; warn for both io_parallel and io_stream

Co-authored-by: Sioni Summers <[email protected]>
… relevant test. Use 5000 MNIST samples rather than full dataset for faster testing
* Support softmax over multidimensional tensors

* Style cleanup

* Added axis part in keras_to_hls.py

* Added some extensions to test_softmax.py but multidimensional softmax is still getting bad performances (i.e. below the one set in the assertion)

* Clean up the softmax test

* Make sure io_parallel softmax is not used on multi-dim input

Co-authored-by: nicologhielmetti <[email protected]>
@jmitrevs
Copy link
Contributor

My reasoning for going to BatchNormalization was to make things simple, with few special cases, since you need to have a Constant + BN and BN + BN fusion regardless for other reasons. Quant -> BN in all cases and then make use of generic optimizations. What the Quant node really became was an annotation and a precision applied to the output (and a corresponding quantizer). It's a holiday here in the US today so I am not sure I'll get a chance to look at this carefully until next week.

@thesps thesps force-pushed the ingest-qonnx-thesps branch from 7c79cdb to 95ed2e9 Compare November 26, 2021 11:49
@jmitrevs
Copy link
Contributor

Can you explain the reasoning for the Quant changes a bit better? How would it work in the "Reshape -> Mul -> Sub -> Quant" sequence in the beginning of TFC_2W2A_clean, for example? The old scheme works into the following steps:

  1. Reshape -> Mul -> Sub -> Quant
  2. Reshape -> BN -> BN -> BN(w/ quant annotation)
  3. Reshape -> BN(w/ quant annotation)

The MatMul -> Dense checks for quant annotation of the input to determine the bit size. (Moving the output bitwidth caclulation to a separate optimization step is planned but not yet implemented; currently it's all in the MatMul->Dense optimization)

The reason I went for Quant -> annotated BatchNormalization was simplicity. There are no special cases. That's what I liked. The real "Quant" part goes into the output annotation, and it can become a part of any node. BN is just for the scale and offset, and it should not add any extra operations after the fusing. (We discussed at a meeting how to handle quantized input and though we did not come to a conclusion, generally the idea of putting a quantizer at the beginning was not favored, so I did not worry about an initial quantizer adding a BN that would not be fused.)

@thesps
Copy link
Contributor Author

thesps commented Nov 26, 2021

Essentially, I don't think it's strictly safe to use a layer type for a different purpose than the one it was designed for. We'd introduce a kind of maintenance overhead to the BatchNormalization layer to remember that it needs to be used for Quant nodes too (for example when designing optimizers), which I argue is unexpected. So BatchNormalization layers should be expected to scale-and-shift, and nothing else.

The BatchNormalization fusion works quite neatly in the TFC-2w2a example because there is always a BatchNormalization before a Quant (not counting weight activations for the moment). In the general case there might not be a BatchNormalization layer, so for example the pattern MatMul -> Quant would transform to MatMul -> BatchNormalization (including multiplication by 1 and addition of 0). So it's neater to go MatMul -> Activation (linear, quantized) to explicitly perform the Quant operation and nothing else.

For the TFC-2w2a case, this:

Reshape -> Mul -> Sub -> Quant
Reshape -> BN -> BN -> BN(w/ quant annotation)
Reshape -> BN(w/ quant annotation)

becomes:

Reshape -> Mul -> Sub -> Quant
Reshape -> BN -> BN -> Activation
Reshape -> BN -> Activation

The weight quantization section changes from

Constant -> Quant -> MatMul
Constant -> BN -> MatMul
BN -> MatMul
...

to

Constant -> Quant -> MatMul
Constant -> MatMul
...

I added a CI test with the TFC-2w2a model, so you can see the full HLS project here.

The reason I went for Quant -> annotated BatchNormalization was simplicity. There are no special cases.

So with my changes there are only two cases: quantized weights (or constants in general) and quantized activations. I think it's sensible to differentiate them anyway because the first is a compile-time operation, while the second is a run-time operation.

BN is just for the scale and offset, and it should not add any extra operations after the fusing.

We actually need to handle scale and offset a bit differently to be correct in the end, I think. We should look at an example with a real scale. But, if I've understood properly a pattern like Constant -> Quant -> MatMul (ie weight quantization) with a scale != 1 should eventually become, for example Dense -> ScaleAndShift (BatchNormalization) because the weights of the MatMul need to be scaled within the range representable by the number of bits specified in the Quant, then the scales need to be 'reinserted' afterwards for correctness. So the scale needs to be anyway handled a bit differently than just multiplying the Constant and then dropping it.

@jmitrevs
Copy link
Contributor

jmitrevs commented Nov 26, 2021

Actually, Constant -> BN becomes an annotated constant in my case, not a BN, so the final result of Constant -> Quant is just a Constant in both cases.

I think the main difference is does one think of a Quant as becoming an annotation that can be applied to any node (Constant, BN, Dense,...) or is it a special Activation node that we need to keep around? I treat Quant as an annotation to be added to a node.

In the examples we have the Quant node is always at the inputs of the MatMul or Conv, not the output. This determines the quantization of the inputs to the MatMul or Conv, so you can se the bit widths of the operations. The form input -> Activation -> Dense seems a bit strange. But as an input quantization, it makes sense, and then you can derive an output quantization by propagating, and annotate the Dense with that. Alternately a Quant following the MatMul or Conv can explictly quantize the results of the calculation, but it doesn't quantize the actual calculation.

@jmitrevs
Copy link
Contributor

jmitrevs commented Nov 26, 2021

By the way, concerning

Dense -> ScaleAndShift (BatchNormalization) because the weights of the MatMul need to be scaled within the range representable by the number of bits specified in the Quant, then the scales need to be 'reinserted' afterwards for correctness.

I always assumed that any scaling a quant does that needs to be undone needs to be explicit in the ONNX. The scale and shift are real. It would not be obvious when to scale back otherwise. A Quant node is local, taking inputs and producing modified outputs.

@jmitrevs
Copy link
Contributor

I could see the argument for a special NOP quantization layer if scale is 1 and offset is 0 if it makes merging easier, though. I wasn't sure if it simplifies or complicates things, so I didn't do it, but it is worth revisiting.

The main thing, though, is what is the final result of a quant node in our model? I thought of it as an output annotation specifying the precision.

@thesps
Copy link
Contributor Author

thesps commented Nov 26, 2021

Actually, Constant -> BN becomes an annotated constant in my case, not a BN, so the final result of Constant -> Quant is just a Constant in both cases.

Yep, my bad, I'd already seen that in both cases it becomes a Constant.

In the examples we have the Quant node is always at the inputs of the MatMul or Conv, not the output.

It doesn't have to be like that though, a Quant node can go anywhere. It just happens that the examples do Layer -> BatchNorm -> Quantized Activation (linear), but other patterns are possible. If the quantized-activation was a quantized-ReLU, for example, the ONNX graph could look like Dense (e.g.) -> BatchNormalization -> ReLU -> Quant. And then you can't profit from the BatchNormalization merging anyway.

I could see the argument for a special NOP quantization layer if scale is 1 and offset is 0 if it makes merging easier, though

The idea in the PR is that the quantization of a "run time tensor" (ie not a Constant) is an explicit operation, represented by its own layer.

I treat Quant as an annotation to be added to a node.

I think that's right for quantized weights, but not for activations (or rather non-constant-tensors).

For the scale and zero-point, the idea is that for a quantized-weight we need to do this in the compiler by modifying the Constant, and this in the FPGA. So the scale needs to be propagated as an attribute of the weights. I'm not doing that yet in this PR either, but that's how it needs to be handled.

A Quant node is local, taking inputs and producing modified outputs.

So actually this isn't totally true. Recall for example the conversations that we had with the FINN team about propagating scale factors through a model. The point is to separate the "real value" of the tensor into a part that can be represented with low bit precision, and a part that can be represented as a floating point scale factor that can be moved around (that we handle by inserting an ApplyAlpha aka BatchNormalization layer).

I'm getting started using this code to generate some simple models with a real (!= 1) scale (With a small modification to save the QONNX model)

There's another example of a QONNX model here with scale != 1 and quantized-ReLUs here.

@jmitrevs
Copy link
Contributor

Remember last Friday we were discussing whether it's MatMul->Quant->ReLU or MatMul->ReLU->Quant, and we decided that it could even be MatMul->Quant->ReLU->Quant. They mean different things:

  1. MatMul->Quant->ReLU: quantize the output of MatMul and then do ReLU
  2. MatMul->ReLU->Quant: do ReLU and quantize its output
  3. MatMul->Quant->ReLU->Quant: quantize the output of MatMul and then do ReLU, and quantize again.

(The acutal quantization of the MatMul operation is specified upstream in all cases). In my scheme, the result at the end for the three cases should be:

  1. MatMul (w/ annotated output)->ReLU: quantize the output of MatMul and then do ReLU
  2. MatMul->ReLU (w/ annotated output): do ReLU and quantize its output
  3. MatMul (w/ annotated output)->ReLU (w/ annotated output): quantize the output of MatMul and then do ReLU, and quantize again.

(If it doesn't go to this, unless there's a good reason to, then something should be modified for it to go to this.) In all cases, though, the Quant becomes an annotation, not a special operation (unless explicit scaling or shifting is required), and the annotation basically determines quantization of the output. Our model requires a precision of the output in all cases, so it seemed natural to me to apply it there.

Is your proposal that the 3 options above be:

  1. MatMul->Linear->ReLU: quantize the output of MatMul and then do ReLU
  2. MatMul->ReLU->Lienar: do ReLU and quantize its output
  3. MatMul->Linear->ReLU->Linear: quantize the output of MatMul and then do ReLU, and quantize again.

@jmitrevs
Copy link
Contributor

So if the proposal is to make a Quant a Linear, preceded, only if necessary, by a BN, I think that would be good. I don't think you necessarily need the Const special case since I think the optimizations should handle it, though it's not a big deal.

The key question, though, is what should be the final form of the Quant. I still think an annotation and output precision is the way to go, but I could be convinced otherwise.

@jmitrevs
Copy link
Contributor

As for the rescaling, I really don't see how that can be inserted automatically. We should maybe discuss this more as a group. I was asking Nhan about it before and my understanding after the discussion became that really the scaling is not undone automatically. Everything needs to be explicit in the graph.

@thesps
Copy link
Contributor Author

thesps commented Nov 27, 2021

I put together a small example of how scaling works here. It's just a single MatMul with a (4 bit) Quant on the weights. Those weights are:

weights:
 [[-0.42833856  0.2461826   0.78714716 -0.7732045 ]
 [ 0.2447649  -0.86163914 -0.11244959 -0.44183114]
 [-0.06119122  0.4923652   0.11244959  0.55228895]]

And the scales are:

scales: [[0.06119122 0.1230913  0.11244959 0.11045779]]

The idea is that only weights / scales are integers:

weights / scales:
 [[-7.  2.  7. -7.]
 [ 4. -7. -1. -4.]
 [-1.  4.  1.  5.]]

Evaluating on some example data:

X: [[ 0.4732726  -0.66137505 -0.6119138 ]]
y_qonnx:            [[-0.3271585   0.385093    0.37809706 -0.41167364]]

To get the correct output, we can either do np.dot(X,w) or np.dot(X, w/s) * s. And the point is that since w are float and only w/s are 4-bit integers, the weights of the Dense in the HLSModel should be w/s, then we need to do the * s in an inserted layer (ApplyAlpha / BatchNormalization, or some better name) after the Dense.

np.dot(X, w/s) * s: [[-0.32715854  0.385093    0.3780971  -0.4116736 ]]

For completeness, here's the hls4ml output (which doesn't work in this ingest-qonnx-thesps branch nor ingest-qonnx yet).

y_hls4ml:           [-5.3447266  3.1308594  3.3583984 -3.7216797]

Everything needs to be explicit in the graph

Hopefully the example shows how all the information is there in the Quant node, we just kind of need to 'factorize' which operations happen where in order to have both low-bitwidth Dense, Conv, etc. layers, and correct results by using scale factors.

I still think an annotation and output precision is the way to go, but I could be convinced otherwise.

I also think this is probably the way, my work here is incomplete in that the 'Activation' I'm inserting should get merged somewhere else later. But, with the scale factors complication, I think these activation-quantizers need to be handled differently from weight-quantizers, and later in the flow.

@jmitrevs
Copy link
Contributor

I will have to try to understand it. Let's talk more next week. I should see what FINN does. The example, though, is for quantized weights, which wouldn't create an Activation node.

@jmitrevs
Copy link
Contributor

jmitrevs commented Dec 2, 2021

Based on the discussion yesterday I will merge this request.

@jmitrevs jmitrevs merged commit 2bf3afe into ingest-qonnx Dec 2, 2021
@jmduarte jmduarte deleted the ingest-qonnx-thesps branch November 2, 2022 02:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants