Skip to content

A request: generalizing the design of affine transforms #7240

@xvdp

Description

@xvdp

🚀 The feature

Torchvision transformations contain legacy code related to PIL, making code a bit cumbersome, limited and containing special cases.

  1. PIL should be fully spit off in some form, maybe with overloads? If one werent concerned with legacy it ought to be completely removed.
  2. In future versions torch vision should be unified with higher dimensional vision algorithms.

PIL
On the image io side, PIL for only handles a limited number of basic formats and not the more interesting ones supporting hdr and floating point data: for instance .exr. While it is true that most data in the wild is .png or .jpg, this is constricting.

An example within the affine() function torchvision/transforms/functional.py

  • interpolation arg is typed as a legacy PIL enum containing interpolation modes not supported by torch interpolate function in torch/nn/functional.py
    This special case then requires another special case in the affine inside torchvision/transforms/functional_tensor.py which is not up to date with the torch interpolate function.
    Specifically the assert on line 611 (as of 2023.02.13 7074570)
    _assert_grid_transform_inputs(img, matrix, interpolation, fill, ["nearest", "bilinear"])
    Should this not support (nearest, area, bilinear, and bicubic)? without this blocking assert which looks derived from having to support both PIL.Image and Tensor in the same function.

There are other design choices that ought to be cleaned up such the same function as requiring Lists (excluding Tensors? what if I get the center from data.mean(axis-...)? and so on) for center, translate and shear, or angles being required in degrees instead of the native radians: if one has the angle in radians one will incur useless loss in the conversion and reconversion.

3d vision
Why 3d and 2d. Even though 2d has been a longtime research topic for DL vision, the full ML vision pipeline includes 3d since ever, as well homogeneous coordinate systems, the most used basic full camera matrix with radial and tangential distorsion as well as the annoying OGL protective space.

There is no reason why torchvision transformation code should only support images and not higher dimensional matrix operations.

I do understand that removing legacy is not simple and yet computer vision is more than uint8 images.

Motivation, pitch

static images, images in motion, images extracted from a 3d world or images projected into the 3d world have no substantive difference: they are all classical computer vision. The trend to re-unify these spaces is here, from 3d GANs to neural rendering.

Why should one not use pytorch3d instead? It, is also cumbersome in that again it considers 3d as separate from 2d and not a continuum in the field of computer vision.

Alternatives

to use drjit ?

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions