Reimplement torch::flip based on advanced indexing #56713

andfoy · 2021-04-22T18:27:57Z

Rationale

This PR improves the performance of torch::flip by using TensorIterator as the same fashion as using AdvancedIndexing. Which means that this implementation is semantically equivalent to indexing a tensor using reverse indices A[dim0_size - 1:0 ..., dimN_size-1:0, ...].

Benchmark results

The following benchmark compares the runtime of this implementation of flip against the current implementation, AdvancedIndexing with reversed indices, as well as OpenCV one. The comparison scenarios consider a 4D tensor [B, C, H, W], where the dimensions flipped correspond to H (vertical flip) and W (horizontal flip) under float32 and uint8 datatypes.

The benchmark implementation details can be found in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/benchmarks.py. Additionally, there are correctness tests against the current flip implementation in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/main.cpp, which tests against different layouts, datatypes and contiguous/non-contiguous tensors.

The following plots correspond to the means of the runtime of each operator after 100 samples. As it is possible to observe, the latest implementation of flip has a runtime similar to the indexing one. Also, the performance gains are up to 6X under some scenarios.

Horizontal flip (float)

Horizontal flip (uint8)

Vertical flip (float)

Vertical flip (uint8)

cc @fmassa @vfdev-5

facebook-github-bot · 2021-04-22T18:28:03Z

💊 CI failures summary and remediations

As of commit ab66825 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

wenleix · 2021-04-27T06:09:10Z

Thanks @andfoy , wondering why levering TensorIterator makes it fast?

fmassa

This generally looks pretty good, thanks!

I've left a few comments.

Also, could you run the performance numbers once more but this time from the PyTorch version that you compiled, just to double-check that under the same compilation flags we get the same speed-up as reported?

Additionally, I it might make sense to see if moving this code to the native/cpu/ folder would bring speed-ups to the code, as it would be compiled with -maxv and -mavx2 flags, potentially allowing for further compiler optimizations.

aten/src/ATen/native/TensorTransformations.cpp

fmassa · 2021-04-27T11:52:50Z

aten/src/ATen/native/TensorTransformations.cpp

+    int64_t offset = *(int64_t*)&indexers[0][idx * indexer_strides[0]];
+    for (int j = 1; j < num_indexers; j++) {
+      offset += *(int64_t*)&indexers[j][idx * indexer_strides[j]];


For the future: we could look into specializing this when num_indexers=1. It could bring additional performance improvements

aten/src/ATen/native/TensorTransformations.cpp

Merge!

…_flip Merge!

…56713

andfoy · 2021-04-29T02:13:54Z

These are the benchmark results for commit 0b63646 against the current torch.flip implementation. The comparison was done by declaring both implementations on the same PyTorch compilation on master...andfoy:benchmark_flip, as it is possible to observe, the results are similar to those presented initially.

Horizontal flip (float)

Horizontal flip (uint8)

Vertical flip (float)

Vertical flip (uint8)

andfoy · 2021-04-29T02:19:55Z

Wondering why levering TensorIterator makes it fast?

@wenleix My guess here would be that by precomputing the indices to flip, the runtime due to this loop is removed:

for (int64_t d = 0; d < total_dims; d++) {
      int64_t temp = cur_indices;
      cur_indices = cur_indices / stride_contiguous_v[d];
      rem = temp - cur_indices * stride_contiguous_v[d];
      dst_offset += flip_dims_b[d] ? (sizes_v[d] - 1 - cur_indices) * strides_v[d] : cur_indices * strides_v[d];
      cur_indices = rem;
}

However, I could be wrong here

ngimel · 2021-04-29T05:43:10Z

@andfoy what do you think of this approach https://github.com/pytorch/pytorch/compare/master...ngimel:flip?expand=1 where TI is used directly, without indexing tensors?
The advantage is that it can be very easily extended to cuda too. Perf benchmarks comparing to existing flip:
Before (time in us):

[-------------------- flip -------------------]
                       |   dim=1    |   dim=2  
1 threads: ------------------------------------
      (7, 112, 3)      |      80.4  |      73.2
      (28, 28, 3)      |      73.4  |      74.0
      (112, 7, 3)      |      71.6  |      74.5
      (8, 2048, 3)     |    1559.4  |    1501.0
      (128, 128, 3)    |    1487.5  |    1488.6
      (2048, 8, 3)     |    1489.0  |    1480.9
      (5, 102400, 3)   |   46704.5  |   46615.1
      (800, 640, 3)    |   46453.5  |   46702.9
      (128000, 4, 3)   |   46944.7  |   47343.2
      (4, 196608, 3)   |   72009.5  |   73965.6
      (1024, 768, 3)   |   72100.0  |   73281.6
      (262144, 3, 3)   |   70512.6  |   72408.6
      (16, 129600, 3)  |  189054.7  |  201812.7
      (1920, 1080, 3)  |  184979.7  |  224558.0
      (230400, 9, 3)   |  197751.2  |  195470.6

After

[------------------ flip -----------------]
                       |  dim=1   |  dim=2 
1 threads: --------------------------------
      (7, 112, 3)      |     3.7  |     4.0
      (28, 28, 3)      |     4.1  |     4.3
      (112, 7, 3)      |     4.8  |     5.0
      (8, 2048, 3)     |    20.1  |    27.6
      (128, 128, 3)    |    22.4  |    35.2
      (2048, 8, 3)     |    41.4  |    50.4
      (5, 102400, 3)   |   628.8  |   731.0
      (800, 640, 3)    |   651.5  |   939.0
      (128000, 4, 3)   |  1698.2  |  2766.2
      (4, 196608, 3)   |  1130.8  |  1145.4
      (1024, 768, 3)   |  1030.7  |  1643.5
      (262144, 3, 3)   |  3464.7  |  5177.9
      (16, 129600, 3)  |  3583.6  |  4012.7
      (1920, 1080, 3)  |  4049.7  |  4525.1
      (230400, 9, 3)   |  5824.7  |  6371.6

Benchmarking script:

CLICK ME

import torch
from torch.utils.benchmark import Timer
from torch.utils.benchmark import Compare
sizes = [
    (7, 112, 3),
    (28, 28, 3),
    (112, 7, 3),

    (8, 2048, 3),
    (128, 128, 3),
    (2048, 8, 3),

    (5, 102400, 3),
    (800, 640, 3),
    (128000, 4, 3),

    (4, 196608, 3),
    (1024, 768, 3),
    (262144, 3, 3),

    (16, 129600, 3),
    (1920, 1080, 3),
    (230400, 9, 3),

    # (16, 518400, 3),
    # (3840, 2160, 3),
    # (921600, 9, 3),
]
results = []

for size in sizes:

    H, W, C = size


    inp = torch.rand(C, H, W)
    #print(inp.size())
    t1 = Timer(stmt = "torch.flip(inp, [1])", sub_label=f"{size}", description="dim=1", label="flip", globals=globals())
    t2 = Timer(stmt = "torch.flip(inp, [2])", sub_label=f"{size}", description="dim=2", label="flip", globals=globals())
    timers = [t1,t2]
    for t in timers:
        results.append(
            t.blocked_autorange()
        )

comparison=Compare(results)
comparison.print()

andfoy · 2021-04-29T06:01:09Z

Since it can scale to CUDA easily, and the changes are way more simpler than the ones proposed, I think it is a good option. So in this order of ideas, basically a loop over the dimensions to flip should be called before calling the actual kernel?

for(int64_t i = 0; i < total_dims; i++) {
   if(flip_dims_b[i]) {
      iter.flip_strides(0, i);
   }
}

ngimel · 2021-04-29T06:25:57Z

No, flip_strides has to be called only once, and it's implementation in TensorIterator flips all the necessary dimensions. I agree that if we could do it in a loop like you propose, it's conceptually cleaner, but the reason it has to be done all at once is after TI is built, the dims that are being flipped are no longer the dims that were originally specified, because TensorIterator coalesces dimensions that it can view as one larger dim.
Imagine there's a 3d tensor where you want to flip the last dim. If input is contiguous, TensorIterator will know that it can;t collapse the last dimension because it will be flipped, but it will collapse first 2 dimensions, and will view the tensor as a 2d tensor (size0*size1, size2). So, when flipping strides, you should no longer flip the 2nd (0-based) stride, you need to flip the 1st! Luckily, we are sending a dummy tensor to TensorIterator that tracks which dimensions actually have to be flipped even after coalescing.
The code I'm proposing is very sparsely tested, so I won't be surprised if there are bugs, don't hold it against me :-)

andfoy · 2021-04-29T18:13:30Z

Thanks for the clarification @ngimel! I'll do a run of your changes against the tests that I have on the other repo.

fmassa · 2021-04-29T18:16:16Z

@ngimel I like your approach with changing TensorIterator directly, but I wonder given how widely used it is if it would be ok to extend it's API for a single function to use it?

andfoy · 2021-05-04T00:08:37Z

I have a question regarding the quantized call for the new flip kernel, should I duplicate this code under quantized/cpu, or should I copy it back to TensorTransformations.cpp, while we also remove FlipKernel.cpp?

ngimel · 2021-05-04T01:11:24Z

Did you verify that duplicating code under native/cpu actually improves performance compared to keeping in in just native?

andfoy · 2021-05-04T01:20:10Z

@ngimel, let me check the performance comparison; if the performance is at par or if the gains are marginal, then I'll keep the kernel under TensorTransforms

andfoy · 2021-05-04T04:46:49Z

I checked the benchmark results, and the differences are not significant, which means that we can leave the kernel in TensorTransformations

codecov · 2021-05-04T19:46:24Z

Codecov Report

Merging #56713 (ab66825) into master (b587354) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #56713      +/-   ##
==========================================
+ Coverage   76.84%   76.86%   +0.02%     
==========================================
  Files        1986     1986              
  Lines      197354   197384      +30     
==========================================
+ Hits       151661   151728      +67     
+ Misses      45693    45656      -37

andfoy · 2021-05-04T20:18:35Z

The error in ROCm seems to be unrelated to this PR

andfoy · 2021-05-05T16:45:54Z

@ngimel @fmassa @wenleix This one is ready for a final review

facebook-github-bot · 2021-05-06T13:41:24Z

@fmassa has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

aten/src/ATen/native/TensorTransformations.cpp

vfdev-5 · 2021-05-07T22:03:45Z

aten/src/ATen/native/TensorTransformations.cpp

+  // numbers get more balanced work load and a better cache location. The grain
+  // size here is chosen by the op benchmark to overcome the thread launch
+  // overhead. This value was taken from the AdvancedIndexing kernel.
+  const int index_parallel_grain_size = 3000;


@andfoy any gains using this value vs default one ?

Let me check!

See #56713 (comment)

andfoy · 2021-05-10T21:15:28Z

These are the benchmark results for commit 118d256, where the default GRAIN_SIZE (32768) is compared against the custom value of this PR (3000). The comparison was done by exposing the grain_size as a parameter to flip (master...andfoy:benchmark_grain_size), as it is possible to observe, the custom value seems to lower the runtime against the default value. All the benchmark values were computed with parallelism enabled.

Horizontal flip (float)

Horizontal flip (uint8)

Vertical flip (float)

Vertical flip (uint8)

Co-authored-by: vfdev <[email protected]>

facebook-github-bot · 2021-05-11T09:17:45Z

@fmassa has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

fmassa

This looks good to me, thanks!

I only have one comment which I think could be good to have, otherwise I think this is good for merge.

Let me know what you think

fmassa · 2021-05-11T18:32:59Z

aten/src/ATen/native/TensorTransformations.cpp

@@ -13,81 +13,145 @@ namespace native {

 constexpr size_t dim_bitset_size = 64;

+Tensor build_index(Tensor input, int64_t flip_dim) {


Can we add all these internal functions inside an anonymous namespace? Given that those names are very generic there could potentially be conflicts with other files.

See for example how it's done in

pytorch/aten/src/ATen/native/cpu/IndexKernel.cpp

Line 12 in 18edb77

namespace {

So that all the build_index, build_indices_loop , make_index_iterator and Indexer are in the private namespace.

Thoughts?

…_flip Merge!

facebook-github-bot · 2021-05-12T08:40:49Z

@fmassa has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

fmassa

Thanks!

facebook-github-bot · 2021-05-12T16:20:44Z

@fmassa merged this pull request in 30f26c5.

Summary: ## Rationale This PR improves the performance of `torch::flip` by using `TensorIterator` as the same fashion as using `AdvancedIndexing`. Which means that this implementation is semantically equivalent to indexing a tensor using reverse indices `A[dim0_size - 1:0 ..., dimN_size-1:0, ...]`. ## Benchmark results The following benchmark compares the runtime of this implementation of `flip` against the current implementation, AdvancedIndexing with reversed indices, as well as OpenCV one. The comparison scenarios consider a 4D tensor `[B, C, H, W]`, where the dimensions flipped correspond to `H` (vertical flip) and `W` (horizontal flip) under float32 and uint8 datatypes. The benchmark implementation details can be found in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/benchmarks.py. Additionally, there are correctness tests against the current flip implementation in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/main.cpp, which tests against different layouts, datatypes and contiguous/non-contiguous tensors. The following plots correspond to the means of the runtime of each operator after 100 samples. As it is possible to observe, the latest implementation of flip has a runtime similar to the indexing one. Also, the performance gains are up to 6X under some scenarios. ### Horizontal flip (float) ![bokeh_plot](https://user-images.githubusercontent.com/1878982/115766715-e72a3d80-a36d-11eb-8552-9005028900b1.png) ### Horizontal flip (uint8) ![bokeh_plot(1)](https://user-images.githubusercontent.com/1878982/115766720-e7c2d400-a36d-11eb-822d-44046882c976.png) ### Vertical flip (float) ![bokeh_plot(2)](https://user-images.githubusercontent.com/1878982/115766721-e7c2d400-a36d-11eb-8f4b-d44c8c33d104.png) ### Vertical flip (uint8) ![bokeh_plot(3)](https://user-images.githubusercontent.com/1878982/115766725-e85b6a80-a36d-11eb-907a-cfcddba555ad.png) cc fmassa vfdev-5 Pull Request resolved: pytorch#56713 Reviewed By: datumbox Differential Revision: D28255088 Pulled By: fmassa fbshipit-source-id: 5b8684812357c331e83a677b99cf0d78f0821678

Reimplement torch::flip based on advanced indexing

25cffd5

facebook-github-bot added the cla signed label Apr 22, 2021

pytorchbot added the open source label Apr 22, 2021

andfoy added 3 commits April 22, 2021 15:14

Fix some compilation errors

dae258c

Use c10 and not at

e0dfdd7

Fix issues with negative indices

f9bc648

ailzhang requested a review from wenleix April 23, 2021 04:36

ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 23, 2021

wenleix requested a review from ngimel April 27, 2021 06:08

fmassa reviewed Apr 27, 2021

View reviewed changes

ngimel reviewed Apr 27, 2021

View reviewed changes

aten/src/ATen/native/TensorTransformations.cpp Outdated Show resolved Hide resolved

andfoy added 9 commits April 28, 2021 13:23

Use bitset

a77393d

Merge remote-tracking branch 'upstream/master' into improve_flip

f569794

Merge!

Remove leftover comment

802ba11

Add comment regarding index_parallel_grain_size

17229c1

Move flip_kernel to native/cpu

bc303d1

Fix dispatch

f002e1f

Merge branch 'master' into improve_flip

d9e3a90

Add trailing newline

88d7cdb

Merge branch 'improve_flip' of github.com:andfoy/pytorch into improve…

0b63646

…_flip Merge!

andfoy added a commit to andfoy/flip-benchmarks that referenced this pull request Apr 29, 2021

Add benchmark script for comparing implementation in pytorch/pytorch#…

aab4d1a

…56713

andfoy added 2 commits May 3, 2021 16:31

Fix case when the flip dimensions as areare empty

3b441ff

Clone input if flip is empty

afd33c0

andfoy added 2 commits May 3, 2021 23:50

Move kernel back to TensorTransformations.cpp

e3cd8f7

Fix issues with quantized initialization

245d8a8

Move index filling to build_index

95f8dd9

vfdev-5 reviewed May 7, 2021

View reviewed changes

Merge with master

118d256

Update aten/src/ATen/native/TensorTransformations.cpp

6bab456

Co-authored-by: vfdev <[email protected]>

fmassa reviewed May 11, 2021

View reviewed changes

andfoy added 2 commits May 11, 2021 15:10

Move utility functions to anonymous namespace

5bd7708

Merge branch 'improve_flip' of github.com:andfoy/pytorch into improve…

ab66825

…_flip Merge!

fmassa approved these changes May 12, 2021

View reviewed changes

facebook-github-bot closed this in 30f26c5 May 12, 2021

facebook-github-bot added the Merged label May 12, 2021

andfoy deleted the improve_flip branch May 24, 2021 23:12

renovate bot mentioned this pull request Jun 18, 2021

chore(deps): update dependency torchvision to v0.10.0 pplmx/di-ting#48

Merged

1 task

		@@ -13,81 +13,145 @@ namespace native {

		constexpr size_t dim_bitset_size = 64;

		Tensor build_index(Tensor input, int64_t flip_dim) {

Reimplement torch::flip based on advanced indexing #56713

Reimplement torch::flip based on advanced indexing #56713

Uh oh!

Conversation

andfoy commented Apr 22, 2021

Rationale

Benchmark results

Horizontal flip (float)

Horizontal flip (uint8)

Vertical flip (float)

Vertical flip (uint8)

Uh oh!

facebook-github-bot commented Apr 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

wenleix commented Apr 27, 2021

Uh oh!

fmassa left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fmassa Apr 27, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andfoy commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Horizontal flip (float)

Horizontal flip (uint8)

Vertical flip (float)

Vertical flip (uint8)

Uh oh!

andfoy commented Apr 29, 2021

Uh oh!

ngimel commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andfoy commented Apr 29, 2021

Uh oh!

ngimel commented Apr 29, 2021

Uh oh!

andfoy commented Apr 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmassa commented Apr 29, 2021

Uh oh!

andfoy commented May 4, 2021

Uh oh!

ngimel commented May 4, 2021

Uh oh!

andfoy commented May 4, 2021

Uh oh!

andfoy commented May 4, 2021

Uh oh!

codecov bot commented May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andfoy commented May 4, 2021

Uh oh!

andfoy commented May 5, 2021

Uh oh!

facebook-github-bot commented May 6, 2021

Uh oh!

Uh oh!

vfdev-5 May 7, 2021

Choose a reason for hiding this comment

Uh oh!

andfoy May 10, 2021

Choose a reason for hiding this comment

Uh oh!

andfoy May 10, 2021

Choose a reason for hiding this comment

Uh oh!

andfoy commented May 10, 2021

Horizontal flip (float)

facebook-github-bot commented Apr 22, 2021 •

edited

Loading

fmassa left a comment •

edited

Loading

andfoy commented Apr 29, 2021 •

edited

Loading

ngimel commented Apr 29, 2021 •

edited

Loading

andfoy commented Apr 29, 2021 •

edited

Loading

codecov bot commented May 4, 2021 •

edited

Loading