🚀 The feature
Looking at the implementation of roi_align_kernel, it seems as if this can be further optimized using openmp parallelization
|
// #pragma omp parallel for num_threads(32) |
Here's what can be done to get performance boost:
- Added
#pragma omp parallel for to the kernel (line 27)
- Added -fopenmp as CFLAG to the compilation
- Set torch.set_num_threads() to desired num of OMP threads (on test/WL side).
Motivation, pitch
I did some experimentation locally in which:
- I've added this optimization
- Built a small test case that calls roi_align
- Profiled
torchvision.ops.roi_align() and measured time using current implementation vs. 18 threads on simple CLX machine.
On my humble experiments it shows 10X performance boost!
Alternatives
There can be other libraries/tooling that can do optimization to this CPU kernel. One can think of oneTBB or something alike.
Nevertheless, the current implementation is a really naive and can easily be much performant.
Additional context
No response
🚀 The feature
Looking at the implementation of roi_align_kernel, it seems as if this can be further optimized using openmp parallelization
vision/torchvision/csrc/ops/cpu/roi_align_kernel.cpp
Line 27 in 840ad8a
Here's what can be done to get performance boost:
#pragma omp parallel forto the kernel (line 27)Motivation, pitch
I did some experimentation locally in which:
torchvision.ops.roi_align()and measured time using current implementation vs. 18 threads on simple CLX machine.On my humble experiments it shows 10X performance boost!
Alternatives
There can be other libraries/tooling that can do optimization to this CPU kernel. One can think of oneTBB or something alike.
Nevertheless, the current implementation is a really naive and can easily be much performant.
Additional context
No response