Skip to content

replace tensor division with scalar division and tensor multiplication #6903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 4, 2022

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Nov 3, 2022

As discussed in #6830 (comment), a tensor vision of a Python scalar is slower than inverting the Python scalar first and performing a tensor multiplication afterwards. The linked comment identified three places where we could use that optimization:

[------------------------------- posterize --------------------------------]
                                      |        main       |      perf-div   
1 threads: -----------------------------------------------------------------
      (3, 512, 512), float32, cpu     |   219 (+-  2) us  |   183 (+-  1) us
      (5, 3, 512, 512), float32, cpu  |  1208 (+- 66) us  |  1123 (+-180) us

Times are in microseconds (us).
[-------------------- convert_dtype (int -> float) --------------------]
                                    |     perf-div     |       main     
1 threads: -------------------------------------------------------------
      (3, 512, 512), uint8, cpu     |   95 (+-  2) us  |  132 (+-  3) us
      (5, 3, 512, 512), uint8, cpu  |  433 (+-  9) us  |  645 (+-  5) us

Times are in microseconds (us).
[----------------------------- adjust_hue -----------------------------]
                                    |     perf-div     |       main     
1 threads: -------------------------------------------------------------
      (3, 512, 512), uint8, cpu     |   15 (+-  1) ms  |   14 (+-  1) ms
      (5, 3, 512, 512), uint8, cpu  |   92 (+-  3) ms  |   90 (+-  4) ms

Times are in milliseconds (ms).

Performance improvement is significant for posterize and convert_dtype. For adjust_hue the change is within measuring tolerance. LMK if we still want this change there.

Apart from the ops above there are a few more places that divide by a Python scalar, but they are always accompanied by a floor rounding like

step = num_non_max_pixels.div_(255, rounding_mode="floor")

Since Tensor.mul does not have that option we need an additional .floor_() afterwards eliminating the gains.

cc @vfdev-5 @datumbox @bjuncek

@NicolasHug
Copy link
Member

Are we sure it also provides the same results as a division?
I'm wondering because if this was a free win without trade-off, I assume it would have been implemented like that in core already?

@pmeier
Copy link
Collaborator Author

pmeier commented Nov 4, 2022

We have pretty extensive tests that make sure that it does. The only thing I needed to adapt was to allow some tolerances on the ConvertImageDtype consistency test (v1 vs. v2). But this is expected, since there might be some minor implementation differences between division and multiplying by the inverse.

TBH, the difference is really insignificant:

FAILED test/test_prototype_transforms_consistency.py::test_call_consistency[ConvertImageDtype-2] - AssertionError: Tensor image consistency check failed with: 

Tensor-likes are not equal!

Mismatched elements: 1392 / 2772 (50.2%)
Greatest absolute difference: 5.960464477539063e-08 at index (0, 0, 0, 0)
Greatest relative difference: 7.9162417294075e-08 at index (0, 0, 0, 31)

This failure is for torch.float32. The differences seem significant for torch.bfloat16, but this is only due to its abysmal precision in favor of a large value range. All differences are well below the already strict default tolerances of torch.testing.assert_close.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thnks!

@pmeier pmeier merged commit 9b0da0c into pytorch:main Nov 4, 2022
@pmeier pmeier deleted the perf-div branch November 4, 2022 09:54
facebook-github-bot pushed a commit that referenced this pull request Nov 14, 2022
…iplication (#6903)

Summary:
* replace tensor division with scalar division and tensor multiplication

* fix consistency test tolerances

Reviewed By: NicolasHug

Differential Revision: D41265200

fbshipit-source-id: bb438d768ebe1137f84df559589ad19eb2e0fc9e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants