torchvision.roi_align performance optimization with openMP 

### 🚀 The feature

Looking at the implementation of roi_align_kernel, it seems as if this can be further optimized using openmp parallelization

https://github.com/pytorch/vision/blob/840ad8abd60b76d340ae0bde33e2230fad38e95a/torchvision/csrc/ops/cpu/roi_align_kernel.cpp#L27

Here's what can be done to get performance boost:
1. Added `#pragma omp parallel for` to the kernel (line 27)
2. Added -fopenmp as CFLAG to the compilation
3. Set torch.set_num_threads() to desired num of OMP threads (on test/WL side).



### Motivation, pitch

I did some experimentation locally in which:
- I've added this optimization 
- Built a small test case that calls roi_align 
- Profiled `torchvision.ops.roi_align()` and measured time using current implementation vs. 18 threads on simple CLX machine. 

On my humble experiments it shows 10X performance boost! 


### Alternatives

There can be other libraries/tooling that can do optimization to this CPU kernel. One can think of oneTBB or something alike. 
Nevertheless, the current implementation is a really naive and can easily be much performant.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torchvision.roi_align performance optimization with openMP #4935

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

torchvision.roi_align performance optimization with openMP #4935

Description

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions