-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Adding support of Video to remaining Transforms and Kernels #6724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
From #6667 (comment) only Plus, why are we only supporting images and videos in |
@pmeier Yes we should add Mixup/Cutmix. The PR is still in progress, I'll update it with more commits soon. Concerning Five/Ten crop, this is usually an inference transform used in Classification. As discussed previously, supporting Detection/Segmentation will require updating the labels/bboxes not just the images. Thus we would need a BC break to support them. Extending support to them would require additional discussions. This PR aims to ensure we support all existing Video compatible kernels in V2 as we do in V1. |
@pmeier As discussed offline, to unblock other pending PRs and reduce conflicts, we'll add MixUp/CutMix on a follow up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments inline. Otherwise LGTM if CI is green.
... def forward(self, sample: Tuple[Tuple[Union[features.Image, features.Video], ...], features.Label]): | ||
... images_or_videos, labels = sample | ||
... batch_size = len(images_or_videos) | ||
... images_or_videos = features.Image.wrap_like(images_or_videos[0], torch.stack(images_or_videos)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... images_or_videos = features.Image.wrap_like(images_or_videos[0], torch.stack(images_or_videos)) | |
... image_or_video = images_or_videos[0] | |
... images_or_videos = type(image_or_video).wrap_like(image_or_video, torch.stack(images_or_videos)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I forced pushed and lost this change. Let me reapply.
) -> Tuple[ | ||
features.ImageOrVideoType, | ||
features.ImageOrVideoType, | ||
features.ImageOrVideoType, | ||
features.ImageOrVideoType, | ||
features.ImageOrVideoType, | ||
]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to ignore if mypy is happy
This is not accurate. We don't have Tuple[features.ImageOrVideoType, ...]
here, but rather Union[Tuple[features.ImageType],...], Tuple[features.VideoType, ...]]
. Meaning, the type will not vary inside the returned tuple. We either get a tuple of images or a tuple of videos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why IMO we should avoid features.ImageOrVideoType
and instead define it as Union[Image, Video]
. Since this will be fixed in a follow up, there is no point messing with mypy (which is happy) here. I'll implement this on a follow up.
@@ -55,6 +55,10 @@ def get_spatial_size_image_pil(image: PIL.Image.Image) -> List[int]: | |||
return [height, width] | |||
|
|||
|
|||
# TODO: Should we have get_spatial_size_video here? How about masks/bbox etc? What is the criterion for deciding when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_spatial_size
should apply to everything, right? That was the whole reason we have extracted it out, because bounding boxes and masks can provide this information, while num_channels
is reserved for images and videos.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already updated get_spatial_size
to handle all inputs.
I think you are trying to answer a different question from what I ask here. What I think we should discuss is whether there should be specific kernels for each type, unrelated to whether the dispatcher can handle everything. We already have kernels (like erase_video
) that aren't necessarily used in the dispatcher. So here I'm asking, what should the convention over providing kernels for individual types should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry, yes I was confused. That is a good question and I don't have an answer for it yet. My gut says that we should stay consistent and provide the kernels just as we do for the other transforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feeling here. I'll leave the TODO for the follow up. I think we can answer this on the PR where we switch image_size
to spatial_size
ef29a77
to
9cbec19
Compare
c9e876b
to
11bc8c2
Compare
tmp = (inpt.wrap_like(inpt, item) for item in output) # type: ignore[arg-type] | ||
output = tmp # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay on merging. JIT was driving nuts:
RuntimeError: expected type comment but found 'ident' here:
output = (inpt.wrap_like(inpt, item) for item in output) # type: ignore[assignment,arg-type]
~~~~~~ <--- HERE
Something breaks if ignore contains both assignment and arg-type directives. I suspect JIT might be looking for an assignment
directive and completely breaks when we apply the mypy workaround. This idiom passes both JIT and mypy.
…6724) Summary: * Adding support of Video to missed Transforms and Kernels * Fixing Grayscale Transform. * Fixing FiveCrop and TenCrop Transforms. * Fix Linter * Fix more kernels. * Add `five_crop_video` and `ten_crop_video` kernels * Added a TODO. * Missed Video isinstance * nits * Fix bug on AugMix * Nits and TODOs. * Reapply Philip's recommendation * Fix mypy and JIT * Fixing test Reviewed By: NicolasHug Differential Revision: D40427468 fbshipit-source-id: e7f699aee86b80ea3f614dc4e09ae1aaf22fc37d
Builds upon #6667
This PR ensures that the following have support for Video:
Grayscale
,RandomGrayscale
,FiveCrop
,TenCrop
,ConvertImageDtype
rgb_to_grayscale
,get_image_size
,five_crop
,ten_crop
five_crop_video
,ten_crop_video
Pending:
MixUp
andCutMix
that might require additional thought on how to implement them.