-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Reimplement torch::flip based on advanced indexing #56713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit ab66825 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Thanks @andfoy , wondering why levering TensorIterator makes it fast? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks pretty good, thanks!
I've left a few comments.
Also, could you run the performance numbers once more but this time from the PyTorch version that you compiled, just to double-check that under the same compilation flags we get the same speed-up as reported?
Additionally, I it might make sense to see if moving this code to the native/cpu/
folder would bring speed-ups to the code, as it would be compiled with -maxv
and -mavx2
flags, potentially allowing for further compiler optimizations.
int64_t offset = *(int64_t*)&indexers[0][idx * indexer_strides[0]]; | ||
for (int j = 1; j < num_indexers; j++) { | ||
offset += *(int64_t*)&indexers[j][idx * indexer_strides[j]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future: we could look into specializing this when num_indexers=1
. It could bring additional performance improvements
These are the benchmark results for commit 0b63646 against the current Horizontal flip (float)Horizontal flip (uint8)Vertical flip (float)Vertical flip (uint8) |
@wenleix My guess here would be that by precomputing the indices to flip, the runtime due to this loop is removed: for (int64_t d = 0; d < total_dims; d++) {
int64_t temp = cur_indices;
cur_indices = cur_indices / stride_contiguous_v[d];
rem = temp - cur_indices * stride_contiguous_v[d];
dst_offset += flip_dims_b[d] ? (sizes_v[d] - 1 - cur_indices) * strides_v[d] : cur_indices * strides_v[d];
cur_indices = rem;
} However, I could be wrong here |
@andfoy what do you think of this approach https://github.com/pytorch/pytorch/compare/master...ngimel:flip?expand=1 where TI is used directly, without indexing tensors?
After
Benchmarking script: CLICK ME
|
Since it can scale to CUDA easily, and the changes are way more simpler than the ones proposed, I think it is a good option. So in this order of ideas, basically a loop over the dimensions to flip should be called before calling the actual kernel? for(int64_t i = 0; i < total_dims; i++) {
if(flip_dims_b[i]) {
iter.flip_strides(0, i);
}
} |
No, |
Thanks for the clarification @ngimel! I'll do a run of your changes against the tests that I have on the other repo. |
@ngimel I like your approach with changing TensorIterator directly, but I wonder given how widely used it is if it would be ok to extend it's API for a single function to use it? |
I have a question regarding the quantized call for the new flip kernel, should I duplicate this code under |
Did you verify that duplicating code under native/cpu actually improves performance compared to keeping in in just native? |
@ngimel, let me check the performance comparison; if the performance is at par or if the gains are marginal, then I'll keep the kernel under |
I checked the benchmark results, and the differences are not significant, which means that we can leave the kernel in |
Codecov Report
@@ Coverage Diff @@
## master #56713 +/- ##
==========================================
+ Coverage 76.84% 76.86% +0.02%
==========================================
Files 1986 1986
Lines 197354 197384 +30
==========================================
+ Hits 151661 151728 +67
+ Misses 45693 45656 -37 |
The error in ROCm seems to be unrelated to this PR |
@fmassa has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
// numbers get more balanced work load and a better cache location. The grain | ||
// size here is chosen by the op benchmark to overcome the thread launch | ||
// overhead. This value was taken from the AdvancedIndexing kernel. | ||
const int index_parallel_grain_size = 3000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andfoy any gains using this value vs default one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #56713 (comment)
These are the benchmark results for commit 118d256, where the default Horizontal flip (float)Horizontal flip (uint8)Vertical flip (float)Vertical flip (uint8) |
Co-authored-by: vfdev <[email protected]>
@fmassa has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks!
I only have one comment which I think could be good to have, otherwise I think this is good for merge.
Let me know what you think
@@ -13,81 +13,145 @@ namespace native { | |||
|
|||
constexpr size_t dim_bitset_size = 64; | |||
|
|||
Tensor build_index(Tensor input, int64_t flip_dim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add all these internal functions inside an anonymous namespace? Given that those names are very generic there could potentially be conflicts with other files.
See for example how it's done in
namespace { |
So that all the build_index
, build_indices_loop
, make_index_iterator
and Indexer
are in the private namespace.
Thoughts?
@fmassa has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Summary: ## Rationale This PR improves the performance of `torch::flip` by using `TensorIterator` as the same fashion as using `AdvancedIndexing`. Which means that this implementation is semantically equivalent to indexing a tensor using reverse indices `A[dim0_size - 1:0 ..., dimN_size-1:0, ...]`. ## Benchmark results The following benchmark compares the runtime of this implementation of `flip` against the current implementation, AdvancedIndexing with reversed indices, as well as OpenCV one. The comparison scenarios consider a 4D tensor `[B, C, H, W]`, where the dimensions flipped correspond to `H` (vertical flip) and `W` (horizontal flip) under float32 and uint8 datatypes. The benchmark implementation details can be found in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/benchmarks.py. Additionally, there are correctness tests against the current flip implementation in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/main.cpp, which tests against different layouts, datatypes and contiguous/non-contiguous tensors. The following plots correspond to the means of the runtime of each operator after 100 samples. As it is possible to observe, the latest implementation of flip has a runtime similar to the indexing one. Also, the performance gains are up to 6X under some scenarios. ### Horizontal flip (float)  ### Horizontal flip (uint8)  ### Vertical flip (float)  ### Vertical flip (uint8)  cc fmassa vfdev-5 Pull Request resolved: pytorch#56713 Reviewed By: datumbox Differential Revision: D28255088 Pulled By: fmassa fbshipit-source-id: 5b8684812357c331e83a677b99cf0d78f0821678
Rationale
This PR improves the performance of
torch::flip
by usingTensorIterator
as the same fashion as usingAdvancedIndexing
. Which means that this implementation is semantically equivalent to indexing a tensor using reverse indicesA[dim0_size - 1:0 ..., dimN_size-1:0, ...]
.Benchmark results
The following benchmark compares the runtime of this implementation of
flip
against the current implementation, AdvancedIndexing with reversed indices, as well as OpenCV one. The comparison scenarios consider a 4D tensor[B, C, H, W]
, where the dimensions flipped correspond toH
(vertical flip) andW
(horizontal flip) under float32 and uint8 datatypes.The benchmark implementation details can be found in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/benchmarks.py. Additionally, there are correctness tests against the current flip implementation in https://github.com/andfoy/flip-benchmarks/blob/main/5_Stable_implementation/main.cpp, which tests against different layouts, datatypes and contiguous/non-contiguous tensors.
The following plots correspond to the means of the runtime of each operator after 100 samples. As it is possible to observe, the latest implementation of flip has a runtime similar to the indexing one. Also, the performance gains are up to 6X under some scenarios.
Horizontal flip (float)
Horizontal flip (uint8)
Vertical flip (float)
Vertical flip (uint8)
cc @fmassa @vfdev-5