-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[RFC] How should datasets handle decoding of files? #5075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Controversial thought: remove We can have a custom type that represents The user can then select whatever type of |
I think this is a good approach as long as we can make it work as seamless as what I proposed above. I currently see one issue: If we separate the decoder from the dataset, how do we use dataset specific handlers (keeping the same terminology as in my comment above) in a non-verbose way? With my proposal, this can be handled internally and so the dataset construction looks like this: from torchvision.prototype import datasets
dataset = datasets.load("foo") With your proposal, we first need to load the static info of the dataset and later construct a from torchvision.prototype import datasets
name = "foo"
info = datasets.info(name)
dataset = datasets.load(name)
dataset = dataset.map(
datasets.decoder.Decoder(
*info.handlers,
*datasets.decoder.default_handlers(),
)
) |
One possible solution that came to mind, is to attach the default decoder to from torchvision.prototype import datasets
dataset = datasets.load("foo")
dataset = datasets.utils.Decoder(dataset) Optionally we could also use the from torchvision.prototype import datasets
dataset = datasets.load("foo").decode() |
I have a question regarding video file.
Do you want to support custom decoder for the dataset?
IIUC the datasets is not a DataPipe instance, you can use |
cc @bjuncek
We need to. Some datasets use either non-standard files or formats with sub-par support by the default decoders. See 1. and 2. in my top post.
|
Let us discuss it over the team meeting then. I think we can release it to you. Let domains to handle corresponding decoders makes more sense to me. |
After some more discussion, we realized there is another requirement: even without decoding, the sample dictionary should be serializable. This eliminates the possibility of using custom file wrappers as originally thought up in #5075 (comment). Our current idea is to always read each file and store the encoded bytes in an uint8 tensor. This has two advantages:
The only downside we are currently seeing is that we lose the ability to not load the data at all. Given that the time to read the bytes is usually dwarfed by the decoding time, we feel this is a good compromise. You can find a proof-of-concept implementation in #5105. |
It is a common feature request (for example #4991) to be able to disable the decoding when loading a dataset. To solve this we added a
decoder
keyword argument to the load mechanism (torchvision.prototype.datasets.load(..., decoder=...)
). It takes anOptional[Callable]
with the following signature:If it is a callable, it will be passed a buffer from the dataset and the result will be integrated into the sample dictionary. If the decoder is
None
instead, the buffer will be integrated in the sample dictionary instead, leaving it to the user to decode.vision/torchvision/prototype/datasets/_builtin/imagenet.py
Lines 132 to 134 in 4cacf5a
This works well for images, but already breaks down for videos as discovered in #4838. The issue is that decoding a video results in more information than a single tensor. The tentative plan in #4838 was to change the signature to
With this, a decoder can now return arbitrary information, which can be integrated in the top level of the sample dictionary.
Unfortunately, looking ahead, I don't think even this architecture will be sufficient. Two issues came to mind:
.mat
,.xml
, or.flo
, will always be decoded. Thus, the user can't completely deactivate the decoding after all. Furthermore, they can also not use any custom decoding for these types if need be..png
images as annotations which have sub-par support byPillow
.To overcome this, I propose a new architecture that is similar to the
RoutedDecoder
datapipe. We should have aDecoder
class that has a sequence ofHandler
's (name up for discussion):If called with a path-buffer-pair the decoder iterates through the registered handlers and returns the first valid output. Thus, each handler can determine based on the path if it is responsible for decoding the current buffer. By default, the decoder will bail if no handler decoded the input. This can be relaxed by the
must_decode=False
flag (name up for discussion), which is a convenient way to have a non-decoder.We would need to change
datasets.load
function toBy default the user would get the dataset specific handlers as well as the default ones. By supplying custom ones, they would be processed with a higher priority and thus overwriting the default behavior if needs be. If
None
is passed we get a true non-encoder. Finally, by passing aDecoder
instance the user has full control over the behavior.Within the dataset definition, the call to the
decoder
would simply look likeor, if multiple buffers need to be decoded,
cc @pmeier @bjuncek
The text was updated successfully, but these errors were encountered: