-
Notifications
You must be signed in to change notification settings - Fork 462
Changes to ingest-qonnx
#461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fix batched multiple inputs
* yaml.safe_load instead of yaml.load * Use yaml.safe_load in converters __init__.py
* Update `zcu102` and `pynq-z2` `axi-stream` driver
…443) * fix 2 reshape issues: don't reshape streams for flatten and remove final reshape * Add a test for a model with Reshape as the final layer * swap * only remove for io_parallel; warn for both io_parallel and io_stream Co-authored-by: Sioni Summers <[email protected]>
… relevant test. Use 5000 MNIST samples rather than full dataset for faster testing
* Support softmax over multidimensional tensors * Style cleanup * Added axis part in keras_to_hls.py * Added some extensions to test_softmax.py but multidimensional softmax is still getting bad performances (i.e. below the one set in the assertion) * Clean up the softmax test * Make sure io_parallel softmax is not used on multi-dim input Co-authored-by: nicologhielmetti <[email protected]>
…into ingest-qonnx-thesps
My reasoning for going to BatchNormalization was to make things simple, with few special cases, since you need to have a Constant + BN and BN + BN fusion regardless for other reasons. Quant -> BN in all cases and then make use of generic optimizations. What the Quant node really became was an annotation and a precision applied to the output (and a corresponding quantizer). It's a holiday here in the US today so I am not sure I'll get a chance to look at this carefully until next week. |
…nd activation-quant to Activation
7c79cdb
to
95ed2e9
Compare
Can you explain the reasoning for the Quant changes a bit better? How would it work in the "Reshape -> Mul -> Sub -> Quant" sequence in the beginning of TFC_2W2A_clean, for example? The old scheme works into the following steps:
The MatMul -> Dense checks for quant annotation of the input to determine the bit size. (Moving the output bitwidth caclulation to a separate optimization step is planned but not yet implemented; currently it's all in the MatMul->Dense optimization) The reason I went for Quant -> annotated BatchNormalization was simplicity. There are no special cases. That's what I liked. The real "Quant" part goes into the output annotation, and it can become a part of any node. BN is just for the scale and offset, and it should not add any extra operations after the fusing. (We discussed at a meeting how to handle quantized input and though we did not come to a conclusion, generally the idea of putting a quantizer at the beginning was not favored, so I did not worry about an initial quantizer adding a BN that would not be fused.) |
Essentially, I don't think it's strictly safe to use a layer type for a different purpose than the one it was designed for. We'd introduce a kind of maintenance overhead to the The For the TFC-2w2a case, this:
becomes:
The weight quantization section changes from
to
I added a CI test with the TFC-2w2a model, so you can see the full HLS project here.
So with my changes there are only two cases: quantized weights (or constants in general) and quantized activations. I think it's sensible to differentiate them anyway because the first is a compile-time operation, while the second is a run-time operation.
We actually need to handle scale and offset a bit differently to be correct in the end, I think. We should look at an example with a real scale. But, if I've understood properly a pattern like |
Actually, I think the main difference is does one think of a In the examples we have the |
By the way, concerning
I always assumed that any scaling a quant does that needs to be undone needs to be explicit in the ONNX. The scale and shift are real. It would not be obvious when to scale back otherwise. A |
I could see the argument for a special NOP quantization layer if scale is 1 and offset is 0 if it makes merging easier, though. I wasn't sure if it simplifies or complicates things, so I didn't do it, but it is worth revisiting. The main thing, though, is what is the final result of a quant node in our model? I thought of it as an output annotation specifying the precision. |
Yep, my bad, I'd already seen that in both cases it becomes a
It doesn't have to be like that though, a
The idea in the PR is that the quantization of a "run time tensor" (ie not a Constant) is an explicit operation, represented by its own layer.
I think that's right for quantized weights, but not for activations (or rather non-constant-tensors). For the scale and zero-point, the idea is that for a quantized-weight we need to do this in the compiler by modifying the
So actually this isn't totally true. Recall for example the conversations that we had with the FINN team about propagating scale factors through a model. The point is to separate the "real value" of the tensor into a part that can be represented with low bit precision, and a part that can be represented as a floating point scale factor that can be moved around (that we handle by inserting an I'm getting started using this code to generate some simple models with a real (!= 1) scale (With a small modification to save the QONNX model) There's another example of a QONNX model here with scale != 1 and quantized-ReLUs here. |
Remember last Friday we were discussing whether it's
(The acutal quantization of the MatMul operation is specified upstream in all cases). In my scheme, the result at the end for the three cases should be:
(If it doesn't go to this, unless there's a good reason to, then something should be modified for it to go to this.) In all cases, though, the Is your proposal that the 3 options above be:
|
So if the proposal is to make a The key question, though, is what should be the final form of the Quant. I still think an annotation and output precision is the way to go, but I could be convinced otherwise. |
As for the rescaling, I really don't see how that can be inserted automatically. We should maybe discuss this more as a group. I was asking Nhan about it before and my understanding after the discussion became that really the scaling is not undone automatically. Everything needs to be explicit in the graph. |
I put together a small example of how scaling works here. It's just a single
And the scales are:
The idea is that only
Evaluating on some example data:
To get the correct output, we can either do
For completeness, here's the
Hopefully the example shows how all the information is there in the
I also think this is probably the way, my work here is incomplete in that the ' |
I will have to try to understand it. Let's talk more next week. I should see what FINN does. The example, though, is for quantized weights, which wouldn't create an Activation node. |
Based on the discussion yesterday I will merge this request. |
@jmitrevs some updates for you.
Firstly, I've pulled master branch to bring this up to date.
Secondly, the main thing I've changed is that
Quant
nodes don't get converted toBatchNormalization
any more.Now, a
Quant
node with aConstant
node 0th input is replaced with aConstant
. It's basically the same logic as what you previously did, but instead of the node going through transformations likeQuant
toBatchNormalization
toConstant
, it just goesQuant
toConstant
. I'm not 100% sure thatscale
andzeropt
are handled properly, but I haven't changed the behaviour of that just yet.A
Quant
node with something that is not aConstant
node as its 0th input is replaced with anActivation
(linear). If one of theseQuant
nodes has ascale
orzeropt
, anApplyAlpha
(akaBatchNormalization
) is inserted to take care of that. Again, some more verification is needed that we're handling those correctly.I've also added a
test_qonnx.py
with a test of theTFC_2w2a
model that works locally, but not yet in CI because I think I messed up the environment. I'll fix that...