Parallel CNNs, Pooling & Image Layers for Quartus Backend #561

bo3z · 2022-06-02T15:14:26Z

Description

📝 Convolutional, Pooling & Image Layers for Quartus Backend

9219a0e Adds support for Conv1D & Conv2D Layers using im2col, in a similar manner to Vivado.

For fs = 3 and fs = 1 an optimized convolution is implemented using Winograd's minimal filtering algorithm (f7365ba) and pointwise im2col (29bd026)

Introduces the idea of parallelisation factor, allowing for fully unrolled convolution, executing in 8 clock cycles

aa66247 Support for Max & Avg Pooling as well as Global Pooling layers

8cb9f21 Support for Zero Padding, Transpose and Upsampling layer

3952854 Corresponding PyTests and HLS resources/latency analysis for all the above layers.

aa66247 Adds support for Vivado 2D Global Pooling (new feature on Vivado)

Type of change

New feature (non-breaking change which adds functionality)
It is recommended to review this PR commit by commit (rather than side-by-side diff), as each commit add a specific feature and this is a fairly extensive PR. Each commit is self-contained and can be checked out and the project compiled.

Implementation details

As a base, convolutional layers are implemented in a similar way to Vivado, using the im2col algorithm. im2col transforms the input matrix to a larger, patch matrix suitable for matrix multiplication with the kernel. This way, the computationally more complex convolution is replaced with dense matrix multiplication. Loops traversing the number of filters and channels are fully unrolled, as the total number of filters in io_parallel is usually low (less than 16), allowing for a constant latency with respect to the number of filters. The loops traversing rows and columns of the input image are pipelined with an initiation interval determined by the reuse factor. A larger reuse factor will reduce resource usage and allow for larger architectures, at the expense of latency.
An optimized convolution for 3x3 kernels is implemented using Winograd Minimal Filtering Algorithm. For a more detailed description of Winograd's minimal algorithm, see:

Lavin & Gray (2015). Fast Algorithms for Convolutional Neural Networks
Xygkis et. al. (2018). Efficient Winograd-based Convolution Kernel Implementation on Edge Devices

Winograd minimal filtering algorithm relies on a series of input and kernel matrix transformation, replacing convolution with an elementwise product. Kernels can be transformed offline, prior FPGA inference. Input matrix transformation can be explicitly written out - when done in such a way, the output matrix can be obtained through additions and subtractions of elements of the original matrix. This way, the transformation is implemented through combinational logic, reducing latency. Winograd's algorithm offers the lowest computationally complexity for convolution, by considering multiple output pixels at once. This way, the stride over the input image is larger than one. For example, for 3x3 kernels, loops iterating over the height (H) and width (W) of the input image are invoked H/2 and W/2 times, respectively, compared to im2col, which invokes the loops H and W times. Each loop has ah higher, but instructions within a loop can usually be executed through combinational logic and register reads/writes;. Winograd's algorithm has several disadvantages, including:

Cannot be used for stride != 1 without significant modifications and latency penalties.
Different implementations (read, matrix transformations) are needed for different kernel sizes. This PR implements Winograd's algorithm for the most commonly used, 3x3 and 3x1 kernel size.
Numerically unstable - Winograd's algorithm is built on top of Lagrange interpolation, known for its poor numerical properties. Therefore, with an increase in kernel size, Winograd's algorithm has a non-negligible error. This error is also noticeable with aggressive quantization, whether through QAT or PTQ. Both problems have been researched but addressing them would come at a latency cost, due to additional transformation and more complex transformation matrices. For 3x3 kernels, this error is negligible and can be used without any loss in accuracy.

Pointwise im2col - Similar to PR Pointwise conv1d/2d resource #471, an optimised implementation for 1x1 kernels of im2col is added.
This PR introduces the idea of a parallelisation factor, in addition to the previously used reuse factor. Reuse factor controls the initiation interval between loops traversing the input image. A large reuse factor will increase the initiation interval, latency and reduce resource usage. On the other hand, the parallelisation factor determines the unroll factor of loops traversing the input image. A larger parallelisation factor will create multiple copies of the loop, lowering latency. The outer loop (input height) is only unrolled if the inner (input width) is fully unrolled. Using this approach, it is possible to compute a full convolutional layer in 8 clock cycles (with a large resource utilisation). Therefore, both these values should be tweaked accordingly when designing an architecture - for larger inputs and models, the reuse factor should be increased to allow fitting onto the available device resources, while keeping the parallelisation factor one. On the other hand, for individual layer with a small input (deeper in the network), the parallelisation can be increased, allowing faster inference. Below are some results with respect to changing the reuse and parallelisation factor. Both of these variables are available for both im2col and Winograd.
Support for Average and Max 1D & 2D Pool layers, as well as Global Pooling. Through experiments (see results below), it was observed that the optimal implementation was a fully unrolled one - it minimises both resource usage and latency (kind of hard to explain why this happens ?). Finally, support for Vivado 2D Global Pooling was added, just for completeness sake.

Latency and resource usage

As expected from theory, Winograd has a lower latency as well as resource usage, when compared to im2col. All test were targeting an Agilex F14, with 10 data points and full Quartus synthesis. Results for different RF & PF will be added once the scan is complete.

As stated above, a fully unrolled pooling layer is optimal. While a pooling layer has no notion of a reuse factor, increasing the overall reuse factor should help reduce the resource and fit the desired architecture, as the reuse factor also dictates the component initiation interval.

Tests

test_cnn_mnist.py - a new unit test, testing the accuracy of a Keras CNN network and its hls4ml counterpart in classifying MNIST digits

test_cnn_mnist_qkeras.py - renamed from test_cnn_mnist.py and included Quartus backend

test_conv1d.py - included Quartus backend

test_global_pooling.py - renamed from test_global_pooling1d.py, included 2D Global Pooling and Quartus as an additional backend, in addition to Vivado.

test_keras_api.py - Adds a basic threshold check, verifying the output of a Conv/Pooling layer from hls4ml is apporximately equal to Keras

test_pointwiseconv.py - included Quartus backend

test_upsampling.py - included Quartus backend

test_zeropading.py - included Quartus backend

Checklist

I have read the guidelines for contributing.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have added tests that prove my fix is effective or that my feature works.

…ling between Winograd and im2col

hls4ml/backends/vivado/passes/pointwise.py

jmitrevs · 2022-09-16T16:26:52Z

Because this mainly adds functionality, I think I would generally approve it fairly quickly after:

It is rebased to not have merge conflicts.
The tests all succeed or fail for known/explainable reasons.

bo3z · 2022-09-19T08:50:16Z

Because this mainly adds functionality, I think I would generally approve it fairly quickly after:

It is rebased to not have merge conflicts.

The tests all succeed or fail for known/explainable reasons.

Rebased and pushed for testing.

jmitrevs · 2022-09-19T09:21:27Z

Can you look at the conv1 pytest that's failing with a type error?

bo3z · 2022-09-19T12:17:23Z

Can you look at the conv1 pytest that's failing with a type error?

All tests covered now, with the exception of:

Softsign LUT optimization - this is caused by another bug, which is addressed in PR Quartus Streaming Softsign (PR #585 contd.) #655
QKeras recurring failing test, due to the seed (this has been present for a few PRs now)

vloncar

LGTM

Parallel CNNs, Pooling & Image Layers for Quartus Backend

bo3z requested a review from vloncar June 2, 2022 15:14

bo3z force-pushed the cnn-quartus branch from 8ba9ed6 to 1065a09 Compare June 21, 2022 12:40

bo3z force-pushed the cnn-quartus branch from 1065a09 to 2fa84c7 Compare June 28, 2022 15:48

bo3z marked this pull request as draft July 8, 2022 10:38

bo3z force-pushed the cnn-quartus branch from 2fa84c7 to 1a6f1f2 Compare August 5, 2022 13:40

bo3z changed the title ~~Parallel CNNs & Pooling on Quartus~~ Parallel CNNs, Pooling & Image Layers for Quartus Backend Aug 11, 2022

bo3z added 2 commits August 15, 2022 14:18

Quartus general-purpose im2col convolution

9219a0e

Quartus Pooling

aa66247

bo3z force-pushed the cnn-quartus branch from b9abd8c to 7bf3b04 Compare August 15, 2022 14:09

bo3z marked this pull request as ready for review August 15, 2022 14:15

bo3z added 4 commits August 15, 2022 15:18

Quartus Convolution using Winograd's minimal filtering algorithm

f7365ba

Quartus Pointwise Convolution

29bd026

Quartus Conv & Pooling PyTests

3952854

Quartus Reshaping Layers

8cb9f21

bo3z force-pushed the cnn-quartus branch from 7bf3b04 to 8cb9f21 Compare August 15, 2022 14:20

Fix RF!=1 on Quartus for non-dense layers

ad7c8d6

bo3z requested a review from jmitrevs August 31, 2022 14:45

Quartus Conv style changes, test simplification & implementation hand…

20c897e

…ling between Winograd and im2col

jmitrevs reviewed Sep 16, 2022

View reviewed changes

hls4ml/backends/vivado/passes/pointwise.py Show resolved Hide resolved

Merge branch 'master' into cnn-quartus

2f73598

Quartus CNN test fixes

345d6a5

bo3z force-pushed the cnn-quartus branch from 91aa30a to 345d6a5 Compare September 20, 2022 11:55

vloncar approved these changes Sep 20, 2022

View reviewed changes

vloncar merged commit 40ae7f9 into fastmachinelearning:main Sep 20, 2022

bo3z mentioned this pull request Sep 20, 2022

Quartus Streaming Conv, Pooling & Image layers #656

Merged

7 tasks

calad0i pushed a commit to calad0i/hls4ml that referenced this pull request Jul 1, 2023

Merge pull request fastmachinelearning#561 from bo3z/cnn-quartus

5aa707d

Parallel CNNs, Pooling & Image Layers for Quartus Backend

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel CNNs, Pooling & Image Layers for Quartus Backend #561

Parallel CNNs, Pooling & Image Layers for Quartus Backend #561

Uh oh!

bo3z commented Jun 2, 2022 •

edited by vloncar

Loading

Uh oh!

Uh oh!

jmitrevs commented Sep 16, 2022

Uh oh!

bo3z commented Sep 19, 2022

Uh oh!

jmitrevs commented Sep 19, 2022

Uh oh!

bo3z commented Sep 19, 2022

Uh oh!

vloncar left a comment

Uh oh!

Uh oh!

Parallel CNNs, Pooling & Image Layers for Quartus Backend #561

Parallel CNNs, Pooling & Image Layers for Quartus Backend #561

Uh oh!

Conversation

bo3z commented Jun 2, 2022 • edited by vloncar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Implementation details

Latency and resource usage

Tests

Checklist

Uh oh!

Uh oh!

jmitrevs commented Sep 16, 2022

Uh oh!

bo3z commented Sep 19, 2022

Uh oh!

jmitrevs commented Sep 19, 2022

Uh oh!

bo3z commented Sep 19, 2022

Uh oh!

vloncar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bo3z commented Jun 2, 2022 •

edited by vloncar

Loading