[RFC] torchvision performance optimization on CPU

## 🚀 The feature

This RFC is targeting at improving performance of operators from torchvision on CPU.


## Motivation, pitch

Generally performance improvements can be made in 3 ways:
* **channels last memory format support**: in torch 1.12, majority of commonly used operators in CV are enabled with channels last support. Enabling channels last support for native kernels in torchvision such as `RoiAlign` pooling could be beneficial because: a) first of all `RoiAlign` can be vectorized on NHWC (on NCHW or channels first memory format, it can only use scalar logic); b) secondly, `Conv2d` can save memory format reorders between PyTorch's plain format and mkldnn's blocked formats.
* **parallelization on multicore CPUs**: current native kernels from torchvision are sequential, which could not utilize all the resources on multicore CPUs.
* **BFloat16 support**: `BFloat16` takes half of the memory footprint of `float32`.

The plan is to cover both inference and training optimizations at the same time.

## Affected Operators
The optimization scope will cover the native kernels from [csrc/ops/cpu](https://github.com/pytorch/vision/tree/main/torchvision/csrc/ops/cpu), including:
* roi_align_kernel
* roi_pool_kernel
* ps_roi_align_kernel
* ps_roi_pool_kernel
* nms_kernel
* deform_conv2d_kernel

These operators will affect models such as `FasterRCNN`, `MaskedRCNN`, etc.

**[Discussion Needed]**: need to sort out the priorities of these kernels.

## API and Behavior Change

Since all the optimizations will be done on the kernel level, no API change will be required.

Users will be able to run models in `channels last` as recommended from [memory_format_tutorial](https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html):

```python
### convert input and model from NCHW to NHWC
input = input.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)
```

To run model in bfloat16 with explicit data type conversion or AMP:
```python
### explicit data type conversion
input = input.to(dtype=torch.bfloat16)
model = model.to(dtype=torch.bfloat16)

### with AMP
with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
    output = model(input)
```

### Non-Batch Mode Input

Some models will have the input in non-batch mode e.g. CHW (N = 1), this can not be converted to channels last in torch at the moment:
```python
### when input is 3-dimensional tensor, the following line will receive a runtime error:
input = input.to(memory_format=torch.channels_last)
```
`torch.nn.conv2d` will check the memory format of `input` and `weight`, if either one of them is channels last, the convolution wil use channels last path. Therefore, for non-batch mode input, we can only converting the `model` and still channels last will be used.

This part requires special attention and validation effort.

## Parallelization on Multi Core CPUs

We propose to follow the identical parallelization scheme with torch, e.g. using the wrapper `at::parallel_for`. It can be linked to **OpenMP** or **TBB** depending on the build option (by default OpenMP will be used).

This [commit](https://github.com/pytorch/vision/commit/1fa27d0f07d3451384ec698d4cd7ee5f4575982b) is an example of paralleling `roi_align` on the 1st dimension of the input tensor, e.g. `n_rois`, with help of `at::parallel_for`.

```C
 at::parallel_for(0, n_rois, 1, [&](int begin, int end) {
    for (int n = begin; n < end; n++) {
      int index_n = n * channels * pooled_width * pooled_height;

      const T* offset_rois = rois + n * 5;
      int roi_batch_ind = offset_rois[0];

      /* rest of the function is identical to original kernel*/
```

## Vectorization on x86 CPUs

Vectorization can be done multiple ways, namely:

### Auto Vectorization

Let compiler automatically vectorize with `#pragma omp simd`, this [commit](https://github.com/pytorch/vision/commit/e50cd530a79f32ade7243ff68ba8d3adbbd6274d) adds channels last support for `roi_align` and did vectorization on the last dimension, e.g. `channels`:

```C
  for (int iy = 0; iy < roi_bin_grid_h; iy++) {
    for (int ix = 0; ix < roi_bin_grid_w; ix++) {
      detail::PreCalc<T> pc = pre_calc[pre_calc_index];
      const T* in1 = input + pc.pos1 * channels;
      const T* in2 = input + pc.pos2 * channels;
      const T* in3 = input + pc.pos3 * channels;
      const T* in4 = input + pc.pos4 * channels;

      #pragma omp simd
      for (int c = 0; c < channels; c++) {
        out[c] += pc.w1 * in1[c] + pc.w2 * in2[c] + pc.w3 * in3[c] + pc.w4 * in4[c];
      }
      pre_calc_index += 1;
    }
  }
```
Note that on NCHW, this kernel can not be vectorized.

* **pros**: easy to implement;
* **cons**: `BFloat16` can not be vectorized by compiler properly, which means if we choose this approach, `RoiAlign` won't have BFloat16 support and will be put into fallback list of AMP;

### Manual Vectorization

Vectorize the code via `at::vec::Vectorized<>` struct, which will be compiled to different assembly depending on arch, **avx2/avx512** or **neon**.

```C
  using Vec = at::vec::Vectorized<T>;
  for (int iy = 0; iy < roi_bin_grid_h; iy++) {
    for (int ix = 0; ix < roi_bin_grid_w; ix++) {
      detail::PreCalc<T> pc = pre_calc[pre_calc_index];
      const T* in1 = input + pc.pos1 * channels;
      const T* in2 = input + pc.pos2 * channels;
      const T* in3 = input + pc.pos3 * channels;
      const T* in4 = input + pc.pos4 * channels;

      int64_t d = 0;
      for (; d < channels - (channels % Vec::size()); d += Vec::size()) {
        Vec out_vec =
            Vec(pc.w1) * Vec::loadu(in1 + d) +
            Vec(pc.w2) * Vec::loadu(in2 + d) +
            Vec(pc.w3) * Vec::loadu(in3 + d) +
            Vec(pc.w4) * Vec::loadu(in4 + d);
        out_vec.store(out + d);
      }
      /* handle the remainder here ... */
      pre_calc_index += 1;
    }
  }
```

* **pros**: support `BFloat16` vectorization; cross platform support.
* **cons**: more effort will be needed to map the build options from torch to torchvision.

From performance point of view, these two approaches would have similar results.

**[Discussion Needed]**: need to decide which way to go.

## Experiment Results

A demo shows performance improvement with `channels last` support on model `fast_rcnn_R_50_FPN_1x` from `detectron2`:

```bash
export DETECTRON2_DATASETS=../datasets
python benchmark.py --config-file ../configs/COCO-Detection/fast_rcnn_R_50_FPN_1x.yaml --task eval
```

**torch**: 1.13.0a0
**torchvision**: 0.14.0a0
**detectron2**: 0.6
**cpu**: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

time of 300 iters   (unit: s) | NCHW (before) | NCHW (after) | NHWC (after) | SpeedUp
-- | -- | -- | -- | --
single core (C=1) | 638.21 | 639.01 | 503.04 | 126.87%
single socket (C=20) | 212.10 | 141.06 | 102.54 | 206.84%

### Breakdown

Here is performance breakdown of NCHW (before) v.s. NHWC (after):

* NCHW (before)
```
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     aten::conv2d         0.32%     676.582ms        41.71%       88.386s       4.830ms         18300
                aten::convolution         0.05%     109.323ms        41.67%       88.300s       4.825ms         18300
               aten::_convolution         0.09%     183.509ms        41.61%       88.168s       4.818ms         18300
         aten::mkldnn_convolution        41.48%       87.890s        41.54%       88.018s       4.810ms         18300
           torchvision::roi_align        38.33%       81.228s        38.99%       82.621s      68.850ms          1200
                     aten::linear         0.00%       7.534ms         5.33%       11.291s       9.410ms          1200
                      aten::addmm         5.11%       10.821s         5.32%       11.272s       9.393ms          1200
                 aten::batch_norm         0.03%      64.973ms         4.51%        9.552s     600.729us         15900
     aten::_batch_norm_impl_index         0.05%     110.204ms         4.48%        9.502s     597.630us         15900
          aten::native_batch_norm         4.40%        9.314s         4.43%        9.396s     590.974us         15900
                       aten::add_         2.06%        4.372s         2.06%        4.372s     910.892us          4800
                      aten::relu_         0.04%      74.794ms         1.76%        3.733s     253.958us         14700
                 aten::clamp_min_         1.73%        3.669s         1.73%        3.669s     249.608us         14700
```

* NHWC (after)
```
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                     aten::conv2d         0.95%     970.076ms        61.52%       63.082s       3.447ms         18300
                aten::convolution         0.12%     121.816ms        61.45%       63.008s       3.443ms         18300
               aten::_convolution         0.14%     140.402ms        61.33%       62.890s       3.437ms         18300
         aten::mkldnn_convolution        60.99%       62.543s        61.08%       62.634s       3.423ms         18300
                 aten::batch_norm         0.06%      57.762ms        10.56%       10.826s     680.901us         15900
     aten::_batch_norm_impl_index         0.12%     126.712ms        10.51%       10.775s     677.660us         15900
          aten::native_batch_norm        10.29%       10.547s        10.38%       10.648s     669.700us         15900
                     aten::linear         0.01%       6.772ms         8.98%        9.205s       7.671ms          1200
                      aten::addmm         8.77%        8.994s         8.96%        9.185s       7.654ms          1200
                       aten::add_         4.60%        4.718s         4.60%        4.718s     982.928us          4800
                      aten::relu_         0.07%      69.159ms         3.80%        3.900s     265.290us         14700
                 aten::clamp_min_         3.75%        3.841s         3.75%        3.841s     261.263us         14700
           torchvision::roi_align         1.61%        1.655s         2.25%        2.304s       1.920ms          1200
```

We can see that the performance improvement primarily comes from:
* `torchvision::roi_align` time reduced from 82.6s to 2.3s, due to parallelization and vectorization.
* `aten::conv2d` time reduced from 88.3s to 63.1s, on channels last, mkldnn reorders on activations will be saved.

## Additional

**[Discussion Needed]**: need to decide details of performance benchmarking, such as:

* models ? use `benchmark.py` from detectron2 or use torch-bench?
* configs ? single core and multi core ? CPU type ?

**[Discussion Needed]**: test cases: we will add new test cases in corresponding modules from [vision/test](https://github.com/pytorch/vision/tree/main/test) when making pull requests, what else is needed?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] torchvision performance optimization on CPU #6619

🚀 The feature

Motivation, pitch

Affected Operators

API and Behavior Change

Non-Batch Mode Input

Parallelization on Multi Core CPUs

Vectorization on x86 CPUs

Auto Vectorization

Manual Vectorization

Experiment Results

Breakdown

Additional

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

time of 300 iters (unit: s)	NCHW (before)	NCHW (after)	NHWC (after)	SpeedUp
single core (C=1)	638.21	639.01	503.04	126.87%
single socket (C=20)	212.10	141.06	102.54	206.84%

[RFC] torchvision performance optimization on CPU #6619

Description

🚀 The feature

Motivation, pitch

Affected Operators

API and Behavior Change

Non-Batch Mode Input

Parallelization on Multi Core CPUs

Vectorization on x86 CPUs

Auto Vectorization

Manual Vectorization

Experiment Results

Breakdown

Additional

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions