-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Enforce contiguous outputs on the transforms v2 kernels? #6839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
IIRC Same for |
I'll benchmark them in a bit, but before maybe another question: if we know that these kernels are sensitive to contiguity, should they know how to handle the situation? Meaning, opposed to what is proposed above, we would shift the burden to the kernel that "needs" contiguous inputs instead of enforcing contiguous outputs everywhere. |
Do you mean it should be up to the kernel to convert its input if it makes it run faster? I don't know if this is something worth doing; same for the global enforcement of contiguity BTW. Because, while this may speed up that one particular kernel, it can still make all the other kernels slower in the rest of the pipeline. |
Yes.
Agreed. I need to benchmark if this has an impact or not. Otherwise there is no need to enforce one or the other. |
Preserving memory_format (contig or channel_last) can be quite important. If there're perf implications of preserving memory format by default (that could avoid some copies later on), some ops may accept additional memory_format arg? |
As stated by @NicolasHug in #6839 (comment), Scriptimport functools
import itertools
from torch.utils import benchmark
from torchvision.io import read_image
from torchvision.prototype.transforms import functional as F
image_other = read_image("test/assets/encode_jpeg/grace_hopper_517x606.jpg").float()
image_channels_last = image_other.unsqueeze(0)
image_contiguous_format = image_other.contiguous()
images = [
(image_contiguous_format, "contiguous_format"),
(image_channels_last, "channels_last"),
(image_other, f"other, stride={image_other.stride()}"),
]
fns = [
(
functools.partial(
F.normalize_image_tensor, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
),
"normalize",
),
(
functools.partial(
F.resize_image_tensor, size=256, interpolation=F.InterpolationMode.BILINEAR
),
"resize BILINEAR",
),
(
functools.partial(
F.resize_image_tensor, size=256, interpolation=F.InterpolationMode.NEAREST
),
"resize NEAREST",
),
(
functools.partial(
F.affine_image_tensor,
angle=0.0,
translate=[0.0, 0.0],
scale=1.0,
shear=[0.0, 0.0],
interpolation=F.InterpolationMode.BILINEAR,
),
"affine BILINEAR",
),
(
functools.partial(
F.affine_image_tensor,
angle=0.0,
translate=[0.0, 0.0],
scale=1.0,
shear=[0.0, 0.0],
interpolation=F.InterpolationMode.NEAREST,
),
"affine NEAREST",
),
]
measurements = [
benchmark.Timer(
stmt="fn(image)",
globals=dict(fn=fn, image=image),
label="impact of input contiguity",
sub_label=name,
description=memory_format,
).blocked_autorange(min_run_time=5)
for (fn, name), (image, memory_format) in itertools.product(fns, images)
]
comparison = benchmark.Compare(measurements)
comparison.trim_significant_figures()
comparison.print()
My conclusion here is that we should not enforce contiguity, but rather identify the kernels that benefit from it and enforce it there. One option to do that is to run our benchmark again (#6818), but this time for inputs in the As for |
I agree. This can be added as an implementation detail/performance-hack on the specific identified kernels. We can then easily move the workarounds if Core oprtimizes the kernels. |
I ran the functional API benchmark from #6818 again with a few tweaks:
full log
Here are the kernels, that are significantly slowed down by noncontiguous inputs:
The only counterexample that we have is
I'll check our pipelines next if and where noncontiguity comes into play. |
All the performance benchmarks that did so far for transforms v1 vs. v2 were on contiguous inputs. However, we have a few kernels that leave the output in a noncontiguous state:
affine_image_tensor
in caseimage.numel() > 0 and image.ndim == 4 and fill is not None
convert_color_space
in case we only strip the alpha channel, i.e.RGB_ALPHA -> RGB
andGRAY_ALPHA -> ALPHA
rotate_image_tensor
in caseimage.numel() > 0 and image.ndim == 4 and fill is not None
crop_image_tensor
center_crop_image_tensor
five_crop_image_tensor
ten_crop_image_tensor
If applicable, the same is also valid for the
*_mask
and*_video
kernels since they are thin wrappers around the*_image_tensor
ones.We should benchmark at least for a few kernels whether noncontiguous inputs cause a performance degredation that is larger than enforcing contiguous outputs on the kernels above. If so we should probably enforce contiguous outputs of our kernels.
cc @vfdev-5 @datumbox @bjuncek
The text was updated successfully, but these errors were encountered: