-
Notifications
You must be signed in to change notification settings - Fork 467
Bug Fix for Operand Shape Mismatch in BatchNorm Fusion (PyTorch) #1045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi! Thanks for bringing this to our attention. The channels_last conversion is actually run for pytorch (and only for pytorch, in fact). It's also active in the case of your model. However, you have stumbled upon a case we overlookied, and that can't be addressed by this conversion since it is applied after the model is created and transposes the tensors in the model. This isn't possible in this case, since we just have a 1D tensor with the wrong size which can't be fixed in that way. I do think your solution is good. However, I would ask for 3 things:
Thanks! |
… and test case has been moved from standalone file to existing pytests
Hi JanFSchlute, thanks for getting back to me so quickly and clarifying the source of my issue. If you could point me towards where specifically the channels_last conversion takes place I'd greatly appreciate it (something tells me I'll need to know that in the future haha). That being said I'm glad the fix is still suitable. As for the PR itself, I made the changes you requested. The original PR has been updated to match the desired format (with my original submission tacked at the bottom), the test case has been moved from a standalone script to the file you linked, and I ran the pre-commit. I also added a RaiseException for unsupported data_formats. Please let me know if there's any other changes you'd like me to make! A note about the pytests-- I noticed in your original pytest for batchnorm you take the model through compilation and compare inference results (which is further than I had pushed the model in the original test file) so I tried to replicate that in my pytest. The model successfully compiles, however aborts on
To reiterate-- the model gets through compilation without error and as far as I can tell this inaccuracy seems unrelated to the proposed bug fix. Just wanted to include the erroring steps in the pytest and bring it to your attention for transparency's sake. Happy to make a separate issue for this inaccuracy once the fix is merged, or keep working on it beforehand (whichever you prefer). |
Thanks for addressing my comments! For the channels_last conversion, this is implemented as one of the optimizers that hls4ml runs after the initial parsing of every model, you can see it here in the list of optimizers that are applied https://github.com/fastmachinelearning/hls4ml/blob/main/hls4ml/model/optimizer/__init__.py#L37. So you will not see an explicit call to it anywhere, but the code in https://github.com/fastmachinelearning/hls4ml/blob/main/hls4ml/model/optimizer/passes/convert_to_channels_last.py is applied to every node in the model graph. As for the other issues, it looks to me like there is something going wrong that I would like to understand before merging this, as we can't merge in code with a failing pytest. I will have a look. |
Three things:
|
Hi Jan, thanks again for the additional information about channels-last conversion and reuse factor. I have reduced the reuse factor like you recommended and pytests are no longer crashing. As for the deeper investigation into batchnorm layer behavior, please let me know if there's anything I should look for or try in order to help. Thanks again for looking into this! |
Hi Jan, I've continued working on my model since we last spoke. I'm encountering similar channels-last-related issues in when trying to synthesize other PyTorch layers (AvgPool2d and shortcut connections). In all of these cases it seems like the source code is assuming the model to be in channels_last format, but for PyTorch channels_first is the default and goes unchanged regardless of whether I set Is there a larger issue with channels last conversion for pytorch models? I remember you mentioned that If you'd like me to make a separate git issue for these layer-specific errors please let me know, and if there is a larger issue you think may be worth looking into please let me know where. Thanks! |
Hi! Thanks for reporting further problems. I was away last week and haven't looked at this in more detail yet. I will do this week. In general hls4ml does indeed assume that all models are in |
I have opened another PR to fix this issue, seemed easier to just merge my changes: #1050 The BatchNorm works fine, the issue was that the pytorch model has to be switch to I will close here, for your other issues, can you open an issue and provide an example to reproduce? Thanks! |
On second thought, this way we are loosing track of your contribution to the fix. So I am reopening this so you can incorporate the further changes I made in https://github.com/fastmachinelearning/hls4ml/pull/1050/files so we can merge your branch instead :) |
Hi Jan, thanks so much for working with me on this! I just merged your pr into mine. Please reach out if there's anything else I need to do before this can be merged. |
@@ -13,15 +13,24 @@ | |||
atol = 5e-3 | |||
|
|||
|
|||
@pytest.fixture(scope='module') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I accidentally committed this change. Can you undo it @sei-rquartiano
@pytest.mark.parametrize('io_type', ['io_parallel', 'io_stream']) | ||
@pytest.mark.parametrize('backend', ['Vivado', 'Vitis', 'Quartus', 'Catapult']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I accidentally committed this change. Can you undo it @sei-rquartiano ?
Sure, no worries! To clarify, I'm adding |
Those are the two changes I made accidentally, yes. Thanks a lot! Once that is done, I think this is ready to merge. And I'm looking forward to your issue on the AvgPool2D so we can make sure the framework works for you. |
Done! Thanks again for all your help. I'll look more into whats going on with AvgPool2D and shortcut connections and will make an issue when I have more information (or hopefully a solution haha). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug fix verified to work, additional tests effective to ensure this type of issue will be caught next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! thanks @sei-rquartiano and @JanFSchulte
Thank you @JanFSchulte and @vloncar for your help! I'm happy the bug fix has been useful and was able to be merged! @JanFSchulte, thanks for finding the cause of those pytest failures; I'm also glad a solution to that was found as well. Im excited to keep working on this and make an issue/PR for AvgPool2d soon. I will be out of town next week but will continue working on this right when I return, so you can expect a submission early-mid September. Cheers |
Main PR Information:
Description of Change
This is a bug fix to prevent erroring during batchnorm fusion of two layers in a PyTorch model (in the test case provided, a Conv1d followed by a BatchNorm1d. The fix involves adding a condition to check the 'data_format' attribute of the parent node in order to properly index its
self.get_output_variable().shape
array. There is supposed to be one bias term per output channel of the previous layer, so if the data format is 'channels_first' its shape tuple is accessed at index 0, and if its 'channels_last' its accessed at -1.Type of change
Tests
Test Configuration: test_batchnorm_pytorch.py
Checklist
pre-commit
on the files I edited or added.Additional PR Information:
Bug:
When synthesizing a simple CNN model with one convolution->batchnorm->relu block written in PyTorch (see test file above), I get the following error:
Cause:
There should be one bias term per output channel of the convolution layer, which is why
bn_scale.data
andbn_bias.data
are vectors of length 2. However,parent_bias.data
is of length 1024, which causes an error when multiplied by bn_scale.data. This is because the dimensions of the parent_node are assumed to be in channels_last format when indexingself.get_output_variable().shape[-1]
(-1 referring to the last dimension of the shape output) in model/layers.pyWhile I do set
channels_last_conversion="full"
during config creation, it doesn't seem to actually do anything. This is confirmed by the hardcoding of thelayer['data_format'] = 'channels_first'
attribute in converters/convolution.py:14. I was actually able to find a channels last converter in model/optimizer/passes/convert_to_channels_last.py but as far as I can tell it isn't called anywhere in the PyTorch conversion process, even if I do setchannels_last_conversion="full"
when making the config. Also according to the comment onconverters/convolution.py:14
this isn't changeable for PyTorch anyway. Maybe this is an in-progress hls4ml feature?Fix:
I added a condition checking the 'data_format' attribute of the parent node. If it is 'channels_last,' the shape of the parent node is indexed at -1 just like before. But if it's 'channels_first' (which according to the converter hardcoding it will always be for PyTorch) it is indexed at 0. Now the
add_bias()
method ofclass Layers()
looks like this:Whenever the channels_last converter is finished/called in PyTorch conversion, I would assume some conditional logic about whether it is actually called called during config will be required, hence me adding it now. After this fix, test file runs without error (see below)