🚀 The feature
This RFC is targeting at improving performance of operators from torchvision on CPU.
Motivation, pitch
Generally performance improvements can be made in 3 ways:
- channels last memory format support: in torch 1.12, majority of commonly used operators in CV are enabled with channels last support. Enabling channels last support for native kernels in torchvision such as
RoiAlign pooling could be beneficial because: a) first of all RoiAlign can be vectorized on NHWC (on NCHW or channels first memory format, it can only use scalar logic); b) secondly, Conv2d can save memory format reorders between PyTorch's plain format and mkldnn's blocked formats.
- parallelization on multicore CPUs: current native kernels from torchvision are sequential, which could not utilize all the resources on multicore CPUs.
- BFloat16 support:
BFloat16 takes half of the memory footprint of float32.
The plan is to cover both inference and training optimizations at the same time.
Affected Operators
The optimization scope will cover the native kernels from csrc/ops/cpu, including:
- roi_align_kernel
- roi_pool_kernel
- ps_roi_align_kernel
- ps_roi_pool_kernel
- nms_kernel
- deform_conv2d_kernel
These operators will affect models such as FasterRCNN, MaskedRCNN, etc.
[Discussion Needed]: need to sort out the priorities of these kernels.
API and Behavior Change
Since all the optimizations will be done on the kernel level, no API change will be required.
Users will be able to run models in channels last as recommended from memory_format_tutorial:
### convert input and model from NCHW to NHWC
input = input.to(memory_format=torch.channels_last)
model = model.to(memory_format=torch.channels_last)
To run model in bfloat16 with explicit data type conversion or AMP:
### explicit data type conversion
input = input.to(dtype=torch.bfloat16)
model = model.to(dtype=torch.bfloat16)
### with AMP
with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
output = model(input)
Non-Batch Mode Input
Some models will have the input in non-batch mode e.g. CHW (N = 1), this can not be converted to channels last in torch at the moment:
### when input is 3-dimensional tensor, the following line will receive a runtime error:
input = input.to(memory_format=torch.channels_last)
torch.nn.conv2d will check the memory format of input and weight, if either one of them is channels last, the convolution wil use channels last path. Therefore, for non-batch mode input, we can only converting the model and still channels last will be used.
This part requires special attention and validation effort.
Parallelization on Multi Core CPUs
We propose to follow the identical parallelization scheme with torch, e.g. using the wrapper at::parallel_for. It can be linked to OpenMP or TBB depending on the build option (by default OpenMP will be used).
This commit is an example of paralleling roi_align on the 1st dimension of the input tensor, e.g. n_rois, with help of at::parallel_for.
at::parallel_for(0, n_rois, 1, [&](int begin, int end) {
for (int n = begin; n < end; n++) {
int index_n = n * channels * pooled_width * pooled_height;
const T* offset_rois = rois + n * 5;
int roi_batch_ind = offset_rois[0];
/* rest of the function is identical to original kernel*/
Vectorization on x86 CPUs
Vectorization can be done multiple ways, namely:
Auto Vectorization
Let compiler automatically vectorize with #pragma omp simd, this commit adds channels last support for roi_align and did vectorization on the last dimension, e.g. channels:
for (int iy = 0; iy < roi_bin_grid_h; iy++) {
for (int ix = 0; ix < roi_bin_grid_w; ix++) {
detail::PreCalc<T> pc = pre_calc[pre_calc_index];
const T* in1 = input + pc.pos1 * channels;
const T* in2 = input + pc.pos2 * channels;
const T* in3 = input + pc.pos3 * channels;
const T* in4 = input + pc.pos4 * channels;
#pragma omp simd
for (int c = 0; c < channels; c++) {
out[c] += pc.w1 * in1[c] + pc.w2 * in2[c] + pc.w3 * in3[c] + pc.w4 * in4[c];
}
pre_calc_index += 1;
}
}
Note that on NCHW, this kernel can not be vectorized.
- pros: easy to implement;
- cons:
BFloat16 can not be vectorized by compiler properly, which means if we choose this approach, RoiAlign won't have BFloat16 support and will be put into fallback list of AMP;
Manual Vectorization
Vectorize the code via at::vec::Vectorized<> struct, which will be compiled to different assembly depending on arch, avx2/avx512 or neon.
using Vec = at::vec::Vectorized<T>;
for (int iy = 0; iy < roi_bin_grid_h; iy++) {
for (int ix = 0; ix < roi_bin_grid_w; ix++) {
detail::PreCalc<T> pc = pre_calc[pre_calc_index];
const T* in1 = input + pc.pos1 * channels;
const T* in2 = input + pc.pos2 * channels;
const T* in3 = input + pc.pos3 * channels;
const T* in4 = input + pc.pos4 * channels;
int64_t d = 0;
for (; d < channels - (channels % Vec::size()); d += Vec::size()) {
Vec out_vec =
Vec(pc.w1) * Vec::loadu(in1 + d) +
Vec(pc.w2) * Vec::loadu(in2 + d) +
Vec(pc.w3) * Vec::loadu(in3 + d) +
Vec(pc.w4) * Vec::loadu(in4 + d);
out_vec.store(out + d);
}
/* handle the remainder here ... */
pre_calc_index += 1;
}
}
- pros: support
BFloat16 vectorization; cross platform support.
- cons: more effort will be needed to map the build options from torch to torchvision.
From performance point of view, these two approaches would have similar results.
[Discussion Needed]: need to decide which way to go.
Experiment Results
A demo shows performance improvement with channels last support on model fast_rcnn_R_50_FPN_1x from detectron2:
export DETECTRON2_DATASETS=../datasets
python benchmark.py --config-file ../configs/COCO-Detection/fast_rcnn_R_50_FPN_1x.yaml --task eval
torch: 1.13.0a0
torchvision: 0.14.0a0
detectron2: 0.6
cpu: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
| time of 300 iters (unit: s) |
NCHW (before) |
NCHW (after) |
NHWC (after) |
SpeedUp |
| single core (C=1) |
638.21 |
639.01 |
503.04 |
126.87% |
| single socket (C=20) |
212.10 |
141.06 |
102.54 |
206.84% |
Breakdown
Here is performance breakdown of NCHW (before) v.s. NHWC (after):
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::conv2d 0.32% 676.582ms 41.71% 88.386s 4.830ms 18300
aten::convolution 0.05% 109.323ms 41.67% 88.300s 4.825ms 18300
aten::_convolution 0.09% 183.509ms 41.61% 88.168s 4.818ms 18300
aten::mkldnn_convolution 41.48% 87.890s 41.54% 88.018s 4.810ms 18300
torchvision::roi_align 38.33% 81.228s 38.99% 82.621s 68.850ms 1200
aten::linear 0.00% 7.534ms 5.33% 11.291s 9.410ms 1200
aten::addmm 5.11% 10.821s 5.32% 11.272s 9.393ms 1200
aten::batch_norm 0.03% 64.973ms 4.51% 9.552s 600.729us 15900
aten::_batch_norm_impl_index 0.05% 110.204ms 4.48% 9.502s 597.630us 15900
aten::native_batch_norm 4.40% 9.314s 4.43% 9.396s 590.974us 15900
aten::add_ 2.06% 4.372s 2.06% 4.372s 910.892us 4800
aten::relu_ 0.04% 74.794ms 1.76% 3.733s 253.958us 14700
aten::clamp_min_ 1.73% 3.669s 1.73% 3.669s 249.608us 14700
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::conv2d 0.95% 970.076ms 61.52% 63.082s 3.447ms 18300
aten::convolution 0.12% 121.816ms 61.45% 63.008s 3.443ms 18300
aten::_convolution 0.14% 140.402ms 61.33% 62.890s 3.437ms 18300
aten::mkldnn_convolution 60.99% 62.543s 61.08% 62.634s 3.423ms 18300
aten::batch_norm 0.06% 57.762ms 10.56% 10.826s 680.901us 15900
aten::_batch_norm_impl_index 0.12% 126.712ms 10.51% 10.775s 677.660us 15900
aten::native_batch_norm 10.29% 10.547s 10.38% 10.648s 669.700us 15900
aten::linear 0.01% 6.772ms 8.98% 9.205s 7.671ms 1200
aten::addmm 8.77% 8.994s 8.96% 9.185s 7.654ms 1200
aten::add_ 4.60% 4.718s 4.60% 4.718s 982.928us 4800
aten::relu_ 0.07% 69.159ms 3.80% 3.900s 265.290us 14700
aten::clamp_min_ 3.75% 3.841s 3.75% 3.841s 261.263us 14700
torchvision::roi_align 1.61% 1.655s 2.25% 2.304s 1.920ms 1200
We can see that the performance improvement primarily comes from:
torchvision::roi_align time reduced from 82.6s to 2.3s, due to parallelization and vectorization.
aten::conv2d time reduced from 88.3s to 63.1s, on channels last, mkldnn reorders on activations will be saved.
Additional
[Discussion Needed]: need to decide details of performance benchmarking, such as:
- models ? use
benchmark.py from detectron2 or use torch-bench?
- configs ? single core and multi core ? CPU type ?
[Discussion Needed]: test cases: we will add new test cases in corresponding modules from vision/test when making pull requests, what else is needed?
🚀 The feature
This RFC is targeting at improving performance of operators from torchvision on CPU.
Motivation, pitch
Generally performance improvements can be made in 3 ways:
RoiAlignpooling could be beneficial because: a) first of allRoiAligncan be vectorized on NHWC (on NCHW or channels first memory format, it can only use scalar logic); b) secondly,Conv2dcan save memory format reorders between PyTorch's plain format and mkldnn's blocked formats.BFloat16takes half of the memory footprint offloat32.The plan is to cover both inference and training optimizations at the same time.
Affected Operators
The optimization scope will cover the native kernels from csrc/ops/cpu, including:
These operators will affect models such as
FasterRCNN,MaskedRCNN, etc.[Discussion Needed]: need to sort out the priorities of these kernels.
API and Behavior Change
Since all the optimizations will be done on the kernel level, no API change will be required.
Users will be able to run models in
channels lastas recommended from memory_format_tutorial:To run model in bfloat16 with explicit data type conversion or AMP:
Non-Batch Mode Input
Some models will have the input in non-batch mode e.g. CHW (N = 1), this can not be converted to channels last in torch at the moment:
torch.nn.conv2dwill check the memory format ofinputandweight, if either one of them is channels last, the convolution wil use channels last path. Therefore, for non-batch mode input, we can only converting themodeland still channels last will be used.This part requires special attention and validation effort.
Parallelization on Multi Core CPUs
We propose to follow the identical parallelization scheme with torch, e.g. using the wrapper
at::parallel_for. It can be linked to OpenMP or TBB depending on the build option (by default OpenMP will be used).This commit is an example of paralleling
roi_alignon the 1st dimension of the input tensor, e.g.n_rois, with help ofat::parallel_for.Vectorization on x86 CPUs
Vectorization can be done multiple ways, namely:
Auto Vectorization
Let compiler automatically vectorize with
#pragma omp simd, this commit adds channels last support forroi_alignand did vectorization on the last dimension, e.g.channels:Note that on NCHW, this kernel can not be vectorized.
BFloat16can not be vectorized by compiler properly, which means if we choose this approach,RoiAlignwon't have BFloat16 support and will be put into fallback list of AMP;Manual Vectorization
Vectorize the code via
at::vec::Vectorized<>struct, which will be compiled to different assembly depending on arch, avx2/avx512 or neon.BFloat16vectorization; cross platform support.From performance point of view, these two approaches would have similar results.
[Discussion Needed]: need to decide which way to go.
Experiment Results
A demo shows performance improvement with
channels lastsupport on modelfast_rcnn_R_50_FPN_1xfromdetectron2:torch: 1.13.0a0
torchvision: 0.14.0a0
detectron2: 0.6
cpu: Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Breakdown
Here is performance breakdown of NCHW (before) v.s. NHWC (after):
We can see that the performance improvement primarily comes from:
torchvision::roi_aligntime reduced from 82.6s to 2.3s, due to parallelization and vectorization.aten::conv2dtime reduced from 88.3s to 63.1s, on channels last, mkldnn reorders on activations will be saved.Additional
[Discussion Needed]: need to decide details of performance benchmarking, such as:
benchmark.pyfrom detectron2 or use torch-bench?[Discussion Needed]: test cases: we will add new test cases in corresponding modules from vision/test when making pull requests, what else is needed?