-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Keypoint transform #1131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for opening this issue and the proposal! I have a question about your proposal: what if we want to have bounding boxes and keypoints as the target for our model, for example in Keypoint R-CNN as in vision/torchvision/models/detection/keypoint_rcnn.py Lines 20 to 100 in bbd363c
I believe we would need to extend the function call to take another argument. And if we want to also have masks at the same time, we would be in the business of adding yet another argument. The problem with those generic approaches is that we sometimes do not want to apply all transforms to all data types. For example, we only want to add color augmentation to the image, not to the segmentation map. This has been extensively discussed in the past, see #9 #230 #533 (and the issues linked therein), and the conclusion for now has been that all the alternatives that have been proposed are either too complicated, or not general enough. For that reason, the current recommended way of handling composite transforms is to use the functional interface, which is very generic and gives you full control, at the expense of a bit more code. We have also recently improved the documentation in #602 in order to make this a bit more clear. Thoughts? The things |
This shouldn’t be a problem, bounding boxes are usually encoded as key points (top-left & bottom right). You can pass those keypoints to the transforms in order to get the corresponding corner for the boxes. This is easily done using a wrapper like the one in the first message. As I see it, the role of a transform is really just to map a point to another point (single responsibility principle), it’s not supposed to be a full bridge between the dateset object and the model. At the end of the day, everything is a point, bounding boxes are defined by their corners which are points, images are really just a grid of WxH points, and masks are either an image or a set of keypoints. The only thing that are not points are labels, but they don’t get transformed. The problem right now is that if you try to use a dataset like VOCDetection and try to augment your data with RandomRotation or RandomPerspective, you get a rotated/squished image, but there is currently no way to transform the bounding boxes, as the transformation parameters are lost once you exit the transform function. I think the design I presented needs to be updated to take a list of images and a list of key points rather than just an image and a list of keypoints, as sometime what you want is to apply the same transform to multiple image (image & segmentation mask pairs). ColorJitter is a bit an edge case as it’s not remapping pixels position but rather modifying their values. Not sure how to solve this one. I was thinking about adding a type check so that since image are encoded as float tensor and mask should usually be more of a int tensor, we could use that information to decide to apply the transform or not, but that would break in the case where we get some pair of (scene, depth/heat map). Another solution would be to add a third parameter so that each transform would receive |
This means that you need to wrap your The problem with the current set of transformations is that you always flip the image as well, even if all what you want is to flip the keypoints (which should be much cheaper than flipping the image). Ideally, those methods should be decoupled, so that one can perform transformations on those data-structures alone. One of the solutions that I have proposed in the past was to have some boxing abstractions, like And then, we can have a composite class Thoughts? |
That's right, the idea was to ship the wrapper along with the dataloader, so that for the end user, it just results in a couple of lines. I'm curently trying to re-implement the YOLO paper for learning purpose, and that's what my dataloading/data augmentation setup looks like: transform = transformWrapper.TransformWrapper(torchvision.transforms.Compose([
torchvision.transforms.Resize(512),
torchvision.transforms.RandomHorizontalFlip(),
torchvision.transforms.RandomRotation(10),
torchvision.transforms.RandomPerspective(distortion_scale=.1, p=1),
torchvision.transforms.RandomCrop(448),
torchvision.transforms.ColorJitter(.2, .2, .2, .2),
torchvision.transforms.ToTensor()]))
vocloader = torchvision.datasets.voc.VOCDetection("/home/wilmot_p/DATA/", transforms=transform)
I strongly agree with the fact that the keypoints and images transforms should be decoupled. But that means we need a system to share the random parameters. I've seen descution about re-seeding the RNG before every transform, and even though that would technicaly work, it feels like bad software design. If we decide to go the decoupled way, the biggest problem we have to solve is RNG synchronisation.
This looks like it should work, and I like the idea introducing proper types, just means a lot more code to write. I'll try to come up with a proof of concept over the weekend, to see how that compares to my earlier proposal in terms of ease of use for the end user 😄 |
Hi, any updates here? Hi, is that possible providing points transformation as a individual repo, so I can import it as a single library? |
Hi Pytorch community 😄
I started working on keypoint transformation (as was requested in #523).
I worked on it in the context of data augmentation for object detection tasks.
I submitted a proposal in PR #1118, but as @fmassa pointed out, that's not something we can merge without reviewing the design choices.
I've implemented the functionality by changing the signature of the transform.call() method from:
def __call__(self, img):
todef run(self, img, keypoints):
so that every transform can work on a list of keypoints in addition to the image itself.I've been with keypoints as a point is the most basic element, bounding boxes are defined as points, segmentation mask can be defined as points, facial landmarks are keypoints ...
If we have the ability to transform a point, we have the ability to transform anything.
My goal with that design was to make the data augmentation as straitforward as possible.
I added a wrapper class to transform the XML annotaion from VOCDetection to a keypoint list and fed then to the transform pipeline.
This allows for an usage as simple as
And the annotations comes out with values corresponding to the resized image.
The aim of this thread is to bring up other usecases of keypoint transformation that I may not have though of and that may be imcompatible with this design, so that we can make a sensible design decision that works for everyone. So if you have an oppinion on this matter, please share 😄
Curently, one of the drawbacks of my design is that I broke the interface for Lambda, it use to take only the image as input parameter, it now takes the image and the keypoint list, and that break retro-compatibility.
The text was updated successfully, but these errors were encountered: