Skip to content

torchvision.roi_align performance optimization with openMP  #4935

@gal-star

Description

@gal-star

🚀 The feature

Looking at the implementation of roi_align_kernel, it seems as if this can be further optimized using openmp parallelization

// #pragma omp parallel for num_threads(32)

Here's what can be done to get performance boost:

  1. Added #pragma omp parallel for to the kernel (line 27)
  2. Added -fopenmp as CFLAG to the compilation
  3. Set torch.set_num_threads() to desired num of OMP threads (on test/WL side).

Motivation, pitch

I did some experimentation locally in which:

  • I've added this optimization
  • Built a small test case that calls roi_align
  • Profiled torchvision.ops.roi_align() and measured time using current implementation vs. 18 threads on simple CLX machine.

On my humble experiments it shows 10X performance boost!

Alternatives

There can be other libraries/tooling that can do optimization to this CPU kernel. One can think of oneTBB or something alike.
Nevertheless, the current implementation is a really naive and can easily be much performant.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions