[RFC] How do we want to deal with images that include alpha channels?

This discussion started in https://github.com/pytorch/vision/pull/5500#discussion_r816503203 and @vfdev-5 and I continued offline.

PIL as well as our image reading functions support RGBA images

https://github.com/pytorch/vision/blob/95d418970e6dbf2e4d928a204c4e620da7bccdc0/torchvision/io/image.py#L16-L31

but our color transformations currently only support RGB images ignoring an extra alpha channel. This leads to wrong results. One thing that we agreed upon is that these transforms should fail if anything but 3 channels is detected.


Still, some datasets include non-RGB images so we need to deal with this for a smooth UX. Previously we implicitly converted every image to RGB before returning it from a dataset

https://github.com/pytorch/vision/blob/f9fbc104c02f277f9485d9f8727f3d99a1cf5f0b/torchvision/datasets/folder.py#L245-L249

Since we no longer decode images in the datasets, we need to provide a solution for the users here. I currently see two possible options:

1. We could deal with this on a per-image basis within the dataset. For example, the train split of ImageNet contains a single RGBA image. We could simply perform an appropriate conversion for irregular image modes in the dataset so this issue is abstracted away from the user. `tensorflow-datasets` uses this approach: https://github.com/tensorflow/datasets/blob/a1caff379ed3164849fdefd147473f72a22d3fa7/tensorflow_datasets/image_classification/imagenet.py#L105-L131
2. The most common non-RGB image in datasets are grayscale images. For example, the train split of ImageNet contains 19970 grayscale images. Thus, the users will need a `transforms.ConvertImageColorSpace("rgb")` in most cases anyway. If that would support RGBA to RGB conversions the problem would also be solved. The conversion happens with this formula:

    ```
    pixel_new = (1 - alpha) * background + alpha * pixel_old
    ```
    
    where `pixel_{old|new}` is a single value from a color channel. Since we don't know `background` we need to either make assumptions or require the user to provide a value for it. I'd wager a guess that in 99% of the cases the background is white. i.e. `background == 1`, but we can't be sure about that.
    
    Another issue with this is that the user has no option to set the background on a per-image basis in the transforms pipeline if that is needed.

    In special case for `alpha == 1` everywhere, the equation above simplifies to

    ```
    pixel_new = pixel_old
    ```

    which is equivalent to stripping the alpha channel. We could check for that and only perform the RGBA to RGB transform if the condition holds or the user supplies a background color.




cc @pmeier @vfdev-5 @datumbox @bjuncek

	class ImageReadMode(Enum):
	"""
	Support for various modes while reading images.

	Use ``ImageReadMode.UNCHANGED`` for loading the image as-is,
	``ImageReadMode.GRAY`` for converting to grayscale,
	``ImageReadMode.GRAY_ALPHA`` for grayscale with transparency,
	``ImageReadMode.RGB`` for RGB and ``ImageReadMode.RGB_ALPHA`` for
	RGB with transparency.
	"""

	UNCHANGED = 0
	GRAY = 1
	GRAY_ALPHA = 2
	RGB = 3
	RGB_ALPHA = 4

	def pil_loader(path: str) -> Image.Image:
	# open path as file to avoid ResourceWarning (https://github.com/python-pillow/Pillow/issues/835)
	with open(path, "rb") as f:
	img = Image.open(f)
	return img.convert("RGB")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] How do we want to deal with images that include alpha channels? #5510

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] How do we want to deal with images that include alpha channels? #5510

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions