Skip to content

Parallel CNNs, Pooling & Image Layers for Quartus Backend #561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 20, 2022

Conversation

bo3z
Copy link
Contributor

@bo3z bo3z commented Jun 2, 2022

Description

📝 Convolutional, Pooling & Image Layers for Quartus Backend

  • 9219a0e Adds support for Conv1D & Conv2D Layers using im2col, in a similar manner to Vivado.
  • For fs = 3 and fs = 1 an optimized convolution is implemented using Winograd's minimal filtering algorithm (f7365ba) and pointwise im2col (29bd026)
  • Introduces the idea of parallelisation factor, allowing for fully unrolled convolution, executing in 8 clock cycles
  • aa66247 Support for Max & Avg Pooling as well as Global Pooling layers
  • 8cb9f21 Support for Zero Padding, Transpose and Upsampling layer
  • 3952854 Corresponding PyTests and HLS resources/latency analysis for all the above layers.
  • aa66247 Adds support for Vivado 2D Global Pooling (new feature on Vivado)

Type of change

  • New feature (non-breaking change which adds functionality)
    It is recommended to review this PR commit by commit (rather than side-by-side diff), as each commit add a specific feature and this is a fairly extensive PR. Each commit is self-contained and can be checked out and the project compiled.

Implementation details

  1. As a base, convolutional layers are implemented in a similar way to Vivado, using the im2col algorithm. im2col transforms the input matrix to a larger, patch matrix suitable for matrix multiplication with the kernel. This way, the computationally more complex convolution is replaced with dense matrix multiplication. Loops traversing the number of filters and channels are fully unrolled, as the total number of filters in io_parallel is usually low (less than 16), allowing for a constant latency with respect to the number of filters. The loops traversing rows and columns of the input image are pipelined with an initiation interval determined by the reuse factor. A larger reuse factor will reduce resource usage and allow for larger architectures, at the expense of latency.
  2. An optimized convolution for 3x3 kernels is implemented using Winograd Minimal Filtering Algorithm. For a more detailed description of Winograd's minimal algorithm, see:
  • Lavin & Gray (2015). Fast Algorithms for Convolutional Neural Networks
  • Xygkis et. al. (2018). Efficient Winograd-based Convolution Kernel Implementation on Edge Devices

Winograd minimal filtering algorithm relies on a series of input and kernel matrix transformation, replacing convolution with an elementwise product. Kernels can be transformed offline, prior FPGA inference. Input matrix transformation can be explicitly written out - when done in such a way, the output matrix can be obtained through additions and subtractions of elements of the original matrix. This way, the transformation is implemented through combinational logic, reducing latency. Winograd's algorithm offers the lowest computationally complexity for convolution, by considering multiple output pixels at once. This way, the stride over the input image is larger than one. For example, for 3x3 kernels, loops iterating over the height (H) and width (W) of the input image are invoked H/2 and W/2 times, respectively, compared to im2col, which invokes the loops H and W times. Each loop has ah higher, but instructions within a loop can usually be executed through combinational logic and register reads/writes;. Winograd's algorithm has several disadvantages, including:

  • Cannot be used for stride != 1 without significant modifications and latency penalties.
  • Different implementations (read, matrix transformations) are needed for different kernel sizes. This PR implements Winograd's algorithm for the most commonly used, 3x3 and 3x1 kernel size.
  • Numerically unstable - Winograd's algorithm is built on top of Lagrange interpolation, known for its poor numerical properties. Therefore, with an increase in kernel size, Winograd's algorithm has a non-negligible error. This error is also noticeable with aggressive quantization, whether through QAT or PTQ. Both problems have been researched but addressing them would come at a latency cost, due to additional transformation and more complex transformation matrices. For 3x3 kernels, this error is negligible and can be used without any loss in accuracy.
  1. Pointwise im2col - Similar to PR Pointwise conv1d/2d resource #471, an optimised implementation for 1x1 kernels of im2col is added.

  2. This PR introduces the idea of a parallelisation factor, in addition to the previously used reuse factor. Reuse factor controls the initiation interval between loops traversing the input image. A large reuse factor will increase the initiation interval, latency and reduce resource usage. On the other hand, the parallelisation factor determines the unroll factor of loops traversing the input image. A larger parallelisation factor will create multiple copies of the loop, lowering latency. The outer loop (input height) is only unrolled if the inner (input width) is fully unrolled. Using this approach, it is possible to compute a full convolutional layer in 8 clock cycles (with a large resource utilisation). Therefore, both these values should be tweaked accordingly when designing an architecture - for larger inputs and models, the reuse factor should be increased to allow fitting onto the available device resources, while keeping the parallelisation factor one. On the other hand, for individual layer with a small input (deeper in the network), the parallelisation can be increased, allowing faster inference. Below are some results with respect to changing the reuse and parallelisation factor. Both of these variables are available for both im2col and Winograd.

  3. Support for Average and Max 1D & 2D Pool layers, as well as Global Pooling. Through experiments (see results below), it was observed that the optimal implementation was a fully unrolled one - it minimises both resource usage and latency (kind of hard to explain why this happens ?). Finally, support for Vivado 2D Global Pooling was added, just for completeness sake.

Latency and resource usage

As expected from theory, Winograd has a lower latency as well as resource usage, when compared to im2col. All test were targeting an Agilex F14, with 10 data points and full Quartus synthesis. Results for different RF & PF will be added once the scan is complete.

Screenshot from 2022-08-15 13-15-41

As stated above, a fully unrolled pooling layer is optimal. While a pooling layer has no notion of a reuse factor, increasing the overall reuse factor should help reduce the resource and fit the desired architecture, as the reuse factor also dictates the component initiation interval.
Screenshot from 2022-08-15 15-04-36

Tests

  • test_cnn_mnist.py - a new unit test, testing the accuracy of a Keras CNN network and its hls4ml counterpart in classifying MNIST digits
  • test_cnn_mnist_qkeras.py - renamed from test_cnn_mnist.py and included Quartus backend
  • test_conv1d.py - included Quartus backend
  • test_global_pooling.py - renamed from test_global_pooling1d.py, included 2D Global Pooling and Quartus as an additional backend, in addition to Vivado.
  • test_keras_api.py - Adds a basic threshold check, verifying the output of a Conv/Pooling layer from hls4ml is apporximately equal to Keras
  • test_pointwiseconv.py - included Quartus backend
  • test_upsampling.py - included Quartus backend
  • test_zeropading.py - included Quartus backend

Checklist

  • I have read the guidelines for contributing.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have added tests that prove my fix is effective or that my feature works.

@bo3z bo3z requested a review from vloncar June 2, 2022 15:14
@bo3z bo3z marked this pull request as draft July 8, 2022 10:38
@bo3z bo3z changed the title Parallel CNNs & Pooling on Quartus Parallel CNNs, Pooling & Image Layers for Quartus Backend Aug 11, 2022
@bo3z bo3z marked this pull request as ready for review August 15, 2022 14:15
@bo3z bo3z requested a review from jmitrevs August 31, 2022 14:45
@jmitrevs
Copy link
Contributor

Because this mainly adds functionality, I think I would generally approve it fairly quickly after:

  1. It is rebased to not have merge conflicts.
  2. The tests all succeed or fail for known/explainable reasons.

@bo3z
Copy link
Contributor Author

bo3z commented Sep 19, 2022

Because this mainly adds functionality, I think I would generally approve it fairly quickly after:

  1. It is rebased to not have merge conflicts.
  2. The tests all succeed or fail for known/explainable reasons.

Rebased and pushed for testing.

@jmitrevs
Copy link
Contributor

Can you look at the conv1 pytest that's failing with a type error?

@bo3z
Copy link
Contributor Author

bo3z commented Sep 19, 2022

Can you look at the conv1 pytest that's failing with a type error?

All tests covered now, with the exception of:

  1. Softsign LUT optimization - this is caused by another bug, which is addressed in PR Quartus Streaming Softsign (PR #585 contd.) #655
  2. QKeras recurring failing test, due to the seed (this has been present for a few PRs now)

Copy link
Contributor

@vloncar vloncar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vloncar vloncar merged commit 40ae7f9 into fastmachinelearning:main Sep 20, 2022
calad0i pushed a commit to calad0i/hls4ml that referenced this pull request Jul 1, 2023
Parallel CNNs, Pooling & Image Layers for Quartus Backend
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants