A request: generalizing the design of affine transforms

### 🚀 The feature

Torchvision transformations contain legacy code related to PIL, making code a bit cumbersome, limited and containing special cases.  
1. PIL should be fully spit off in some form, maybe with overloads? If one werent concerned with legacy it ought to be completely removed.
2. In future versions torch vision should be unified with higher dimensional vision algorithms. 

 PIL
On the image io side,  PIL for  only handles a limited number of basic formats and not the more interesting  ones supporting hdr and floating point data:  for instance .exr. While it is true that most data in the wild is .png or .jpg, this is constricting. 

An example within the affine() function `torchvision/transforms/functional.py`
* interpolation arg is typed as a legacy PIL enum containing interpolation modes not supported by torch interpolate function in `torch/nn/functional.py`
This special case then requires another special case in the affine inside torchvision/transforms/functional_tensor.py which is not up to date with the torch interpolate function. 
Specifically the assert on line 611 (as of 2023.02.13 707457050620e1f70ab1b187dad81cc36a7f9180) 
`_assert_grid_transform_inputs(img, matrix, interpolation, fill, ["nearest", "bilinear"])`
Should this not support (nearest, area, bilinear, and bicubic)? without this blocking assert which looks derived from having to support both PIL.Image and Tensor in the same function.

There are other design choices that ought to be cleaned up such the same function as requiring  Lists (excluding Tensors? what if I get the center from data.mean(axis-...)? and so on) for center, translate and shear,  or angles being required in degrees instead of the native radians: if one has the angle in radians one will incur useless loss in the conversion and reconversion.

 3d vision
Why 3d and 2d. Even though 2d has been a longtime research topic for DL vision, the full ML vision pipeline  includes 3d since ever, as well homogeneous coordinate systems,  the most used basic full camera matrix with radial and tangential distorsion as well as  the annoying OGL protective space.  

There is no reason why torchvision transformation code should only support images and not higher dimensional matrix operations. 

I do understand that removing legacy is not simple and yet computer vision is more than uint8 images. 




### Motivation, pitch

static images, images in motion, images extracted from a 3d world or images projected into the 3d world have no substantive difference: they are all classical computer vision. The trend to re-unify these spaces is here, from 3d GANs to neural rendering.

Why should one not use pytorch3d instead? It, is also cumbersome in that again it considers 3d as separate from 2d and not a continuum in the field of computer vision. 

### Alternatives

to use drjit ?

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A request: generalizing the design of affine transforms #7240

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A request: generalizing the design of affine transforms #7240

Description

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions