-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[NOMERGE] Review new Transforms API #5500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 969635e (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
@@ -39,6 +39,7 @@ def __new__( | |||
def to_format(self, format: Union[str, BoundingBoxFormat]) -> BoundingBox: | |||
# TODO: this is useful for developing and debugging but we should remove or at least revisit this before we | |||
# promote this out of the prototype state | |||
# Do we still need this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need, no. Want, yes. I actually want to keep them after this is promoted out of prototype. It is a convenient way to work with the features outside of a training pipeline. Compare for example
from torchvision.prototype import features, transforms
transform = transforms.ConvertBoundingBoxFormat(format="xyxy")
bounding_box = features.BoundingBox(...)
xyxy_bounding_box = transform(bounding_box)
or
from torchvision.prototype import features
from torchvision.prototype.transforms import functional as F
bounding_box = features.BoundingBox(...)
xyxy_bounding_box = features.BoundingBox.new_like(
bounding_box,
convert_bounding_box_format(bounding_box, old_format=bounding_box.format, new_format="xyxy"),
format="xyxy",
)
to
from torchvision.prototype import features
bounding_box = features.BoundingBox(...)
xyxy_bounding_box = bounding_box.to_format("xyxy")
Still, I agree, the lazy import needs to go. My proposal is to move the actual conversion code from transforms.functional
to a private function like _convert_bounding_box_format
in this module. We could use it here in to_format
and simply expose it in transforms.functional
.
This comment applies to all other comments that ask the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the verbosity of option 2 is an indication that we need a utility. It doesn't necessarily means that the method should be part of BoundingBox, so let's explore options. Note that we should avoid the use of strings in favour of their enums in our examples and internal code to get better static analysis.
This comment applies to all other comments that ask the same.
The 2 visualization methods show()
and draw_bounding_box()
on Image
are not the same thing. We should remove them as we don't want to necessarily push PIL on the new API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we should avoid the use of strings in favour of their enums in our examples and internal code to get better static analysis.
Internally I agree 100%, but do we actually want to discourage users from using them in application code? "xyxy"
is more concise than BoundingBoxFormat.XYXY
and potentially needs one less import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have previously made the decision to move with enums (you can see this on multiple APIs including io, multi-weights etc). For consistency we should continue this here.
Where we agree is that people should be allowed to pass strings if they want to. I want to discourage them from doing so by offering all the examples an internal code using enums but if they choose to pass strings so be it. I just don't want to promote this usage. Note that this is true for virtually all places where we have enums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, quietly supporting strings is still a good idea, but for everything else like documentation or examples we should use the enums.
# On io, we currently support Alpha transparency for Grayscale and RGB. We also support palette for PNG. | ||
# Our enum must know about these. We should also allow for users to strip out the alpha transparency which is | ||
# currently not supported by any colour transform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also support palette for PNG.
Could you add a link for that? Otherwise, we are talking about this enum, correct?
vision/torchvision/io/image.py
Lines 16 to 31 in 95d4189
class ImageReadMode(Enum): | |
""" | |
Support for various modes while reading images. | |
Use ``ImageReadMode.UNCHANGED`` for loading the image as-is, | |
``ImageReadMode.GRAY`` for converting to grayscale, | |
``ImageReadMode.GRAY_ALPHA`` for grayscale with transparency, | |
``ImageReadMode.RGB`` for RGB and ``ImageReadMode.RGB_ALPHA`` for | |
RGB with transparency. | |
""" | |
UNCHANGED = 0 | |
GRAY = 1 | |
GRAY_ALPHA = 2 | |
RGB = 3 | |
RGB_ALPHA = 4 |
I'll send a PR to include the RGBA
and GRAYSCALE_ALPHA
into the enum and also provide conversion functions for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking into this, conversion from RGBA to RGB is more complicated. For a proper conversion we need to know the background color. We could assume white which will probably work for most cases, but we probably should discuss this first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Palettes are supported if the user reads with IMAGE_READ_MODE_UNCHANGED
. The outcome will be an image with a single channel that has integer IDs for every colour. This is ideal for reading masks that have integer IDs for the classes.
Effectively palettes look like grayscale images but their values being ids on the 0-255 scales. Given they can be useful only for niche applications, I don't feel strongly about supporting them on this API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paletted images can be also used to store segmentation masks e.g. VOC
# 1. Deprecated methods which have equivalents on the new API (_legacy.py?): | ||
# - Grayscale, RandomGrayscale: use ConvertImageColorSpace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
# - Pad, RandomCrop, RandomHorizontalFlip, RandomVerticalFlip, RandomPerspective, FiveCrop, TenCrop, ColorJitter, | ||
# RandomRotation, RandomAffine, GaussianBlur, RandomInvert, RandomPosterize, RandomSolarize, RandomAdjustSharpness, | ||
# RandomAutocontrast, RandomEqualize, LinearTransformation (must be added) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 For everything but FiveCrop
and TenCrop
. They are outliers in the sense that they get one input and produce multiple outputs. This kind of structure would be suited a lot better to a datapipe rather than a transform.
This is also reinforced by the fact that these transforms can only be put at the end of any pipeline with the old API given that the output type is not compatible with all the other transforms. Even then, using such a transform would most likely need a custom collate function.
I couldn't find any usage in our reference scripts. Does someone know how exactly they are used so I can propose something more concrete here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is if we want to roll out the API with BC support, we must include implementations for all previous methods + offer alternatives for those that would be placed in deprecated state. Though I understand the challenge, excluding these methods is not OK and we should flag if others like these exist.
The specific transforms are used to produce multiple crops for inference in classification. It is a common trick (especially in the early days of ImageNet competition) to boost the accuracy of the model by averaging the responses of multiple crops. You can see an example of how this is used here.
# - Pad, RandomCrop, RandomHorizontalFlip, RandomVerticalFlip, RandomPerspective, FiveCrop, TenCrop, ColorJitter, | ||
# RandomRotation, RandomAffine, GaussianBlur, RandomInvert, RandomPosterize, RandomSolarize, RandomAdjustSharpness, | ||
# RandomAutocontrast, RandomEqualize, LinearTransformation (must be added) | ||
# - PILToTensor, ToPILImage (_legacy.py?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also have a more general ConvertImageType
transform here that takes one of "pil"
, "vanilla_tensor"
, "feature"
as input? That would follow the same approach that we have for the other conversions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is what I mean. Let's say we are close to releasing the prototype and we have achieved a significant level of BC. On day 0, we will need a PILToTensor to exist. These are likely be in a deprecated state with warnings informing the user that this functionality is removed and they should use X. Your proposal seems like a valid X alternative.
So here I think we need to figure out 2 things:
- What do we do with these legacy classes + provide their implementations on the new API (or aliases). Make sure that the classes continue to work along with the other non-deprecated classes in existing pipelines.
- Provide an alternative class that offers the new functionality in the new API.
# Question: Why do we have both this and a ToDtype Transform? Is this due to BC? Ideally we could move people off | ||
# from methods that did an implicit normalization of the values (like this one, or to_tensor). cc @vfdev-5 | ||
# If that's the case, we should move to _legacy and add deprecation warnings from day one to push people to use | ||
# the new methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea that @datumbox and I discussed a while back is abolish the close correspondence of the dtype and the value range for images. Right now we assume the range [0, 1]
for floating point images and [0, torch.iinfo(dtype).max]
for integral images. A possible solution for this would be have a value_range
meta data on the features.Image
class. It can be set with the defaults from above so for a regular use case the user won't feel any difference.
For example, let's say we have a uint8 image
import torch
from torchvision.prototype import features
image = features.Image(torch.randint(0, 256, (3, 128, 128)))
By default print(image.value_range)
would print (0, 255)
. If we now convert the dtype like image = image.to(torch.float32)
, printing the value range again would give (0.0, 255.0)
rather than (0.0, 1.0)
what happens through ConvertImageDtype
.
This means, all transformations that implicitly rely on the value range like Normalize
need to be adapted to use the explicit value instead, but we could get rid of ConvertImageDtype
all together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I absolutely love this and 100% agree we should do it. I've previously asked @vfdev-5 to create an issue and see which methods of the low-level kernels should be expanded to receive a max_value. We can write this in a BC manner on the low-level kernels; if the max_value
is not passed we do the previous behaviour. Tensors
will continue making this assumption, but Images
shouldn't.
I feel this is a critical part of the proposal that needs to be investigated prior implementing all other transforms. This is probably one of the few remained uninvestigated bits of the proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI #5502
@@ -63,3 +69,13 @@ | |||
) | |||
from ._misc import normalize_image_tensor, gaussian_blur_image_tensor | |||
from ._type_conversion import decode_image_with_pil, decode_video_with_av, label_to_one_hot | |||
|
|||
# What are the migration plans for public methods without new API equivalents? There are two categories: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmeier Reminder for migration plan on these as well. Might be already on your radar as part of the Transform Classes.
Adding comments for the proposed new Transforms API as a whole. Part of the comments need to be addressed now, others are FYI and others can be tackled by opening issues and to track the work.
There are a few low-hanging fruit improvements that can be done immediately in separate PRs without much discussion. Others definitely require more thought and discussion so for these I propose having a chat offline and the summarizing here.
cc @pmeier @vfdev-5