-
Notifications
You must be signed in to change notification settings - Fork 462
Parallel CNNs, Pooling & Image Layers for Quartus Backend #561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ling between Winograd and im2col
Because this mainly adds functionality, I think I would generally approve it fairly quickly after:
|
Rebased and pushed for testing. |
Can you look at the conv1 pytest that's failing with a type error? |
All tests covered now, with the exception of:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Parallel CNNs, Pooling & Image Layers for Quartus Backend
Description
Type of change
It is recommended to review this PR commit by commit (rather than side-by-side diff), as each commit add a specific feature and this is a fairly extensive PR. Each commit is self-contained and can be checked out and the project compiled.
Implementation details
io_parallel
is usually low (less than 16), allowing for a constant latency with respect to the number of filters. The loops traversing rows and columns of the input image are pipelined with an initiation interval determined by the reuse factor. A larger reuse factor will reduce resource usage and allow for larger architectures, at the expense of latency.Winograd minimal filtering algorithm relies on a series of input and kernel matrix transformation, replacing convolution with an elementwise product. Kernels can be transformed offline, prior FPGA inference. Input matrix transformation can be explicitly written out - when done in such a way, the output matrix can be obtained through additions and subtractions of elements of the original matrix. This way, the transformation is implemented through combinational logic, reducing latency. Winograd's algorithm offers the lowest computationally complexity for convolution, by considering multiple output pixels at once. This way, the stride over the input image is larger than one. For example, for 3x3 kernels, loops iterating over the height (H) and width (W) of the input image are invoked H/2 and W/2 times, respectively, compared to im2col, which invokes the loops H and W times. Each loop has ah higher, but instructions within a loop can usually be executed through combinational logic and register reads/writes;. Winograd's algorithm has several disadvantages, including:
Pointwise im2col - Similar to PR Pointwise conv1d/2d resource #471, an optimised implementation for 1x1 kernels of im2col is added.
This PR introduces the idea of a parallelisation factor, in addition to the previously used reuse factor. Reuse factor controls the initiation interval between loops traversing the input image. A large reuse factor will increase the initiation interval, latency and reduce resource usage. On the other hand, the parallelisation factor determines the unroll factor of loops traversing the input image. A larger parallelisation factor will create multiple copies of the loop, lowering latency. The outer loop (input height) is only unrolled if the inner (input width) is fully unrolled. Using this approach, it is possible to compute a full convolutional layer in 8 clock cycles (with a large resource utilisation). Therefore, both these values should be tweaked accordingly when designing an architecture - for larger inputs and models, the reuse factor should be increased to allow fitting onto the available device resources, while keeping the parallelisation factor one. On the other hand, for individual layer with a small input (deeper in the network), the parallelisation can be increased, allowing faster inference. Below are some results with respect to changing the reuse and parallelisation factor. Both of these variables are available for both im2col and Winograd.
Support for Average and Max 1D & 2D Pool layers, as well as Global Pooling. Through experiments (see results below), it was observed that the optimal implementation was a fully unrolled one - it minimises both resource usage and latency (kind of hard to explain why this happens ?). Finally, support for Vivado 2D Global Pooling was added, just for completeness sake.
Latency and resource usage
As expected from theory, Winograd has a lower latency as well as resource usage, when compared to im2col. All test were targeting an Agilex F14, with 10 data points and full Quartus synthesis. Results for different RF & PF will be added once the scan is complete.
As stated above, a fully unrolled pooling layer is optimal. While a pooling layer has no notion of a reuse factor, increasing the overall reuse factor should help reduce the resource and fit the desired architecture, as the reuse factor also dictates the component initiation interval.

Tests
Checklist