Port `RASampler` functionality to iterable datasets #6018

pmeier · 2022-05-16T08:11:19Z

Our current classification references use the "Repeated Augment Sampler" (RASampler) from #5051:

vision/references/classification/sampler.py

class RASampler(torch.utils.data.Sampler):

Since after the revamp we will have iterable- rather than map-style datasets, samplers are no longer supported.

Given that the RASampler increases accuracy, we need to support the same functionality going forward. It can probably be achieved by appending a custom RepeatedAugmentIterDataPipe to the dataset graph, but we need to make sure it works correctly with the shuffling and sharding.

cc @pmeier @YosuaMichael @datumbox @vfdev-5 @bjuncek

The text was updated successfully, but these errors were encountered:

NicolasHug · 2022-05-16T08:35:51Z

The RASampler is currently implemented by dispatching all copies of the same sample to different GPUs.
Different GPUs use different RNGs, so the different copies are all transformed differently, which is the end goal.

It's fairly hacky: it relies on the existence of DDP, and on the assumption that different GPUs use different RNGs. This is usually True, but there is no general guarantee for arbitrary user code.

I think we should be able to implement a datapipe-compatible RASampler by simply not sharding across GPUs (or only partially sharding). Still hacky, but not much more than what we currently have.

datumbox · 2022-05-16T08:49:15Z

@NicolasHug Agreed that the previous implementation is hacky and can be simplified/improved. I've asked Philip to create a separate ticket so that I can remove the feature from the Batteries Included 2 project. Feel free to edit the description to outline potential changes on the implementation. As long as the new class provides the same result as the previous, we should be OK to make changes to its implementation.

pmeier · 2022-05-16T12:09:03Z

One thing that came to mind thinking about this is that we will be wasting quite some I/O resources when using this functionality. The prototype datasets will read the raw bytes and store them in a tensor for later decoding. Since we will throw away the majority of the samples using this technique, we wasted all these reads. I don't know if that is significant since we only read and not decode, but we should watch out for this.

NicolasHug · 2022-05-16T12:13:07Z

Another sub-optimal thing is that we would be reading and decoding the same sample multiple times instead of once. That's already a problem with our current implementation though.

datumbox · 2022-05-16T14:55:09Z

@NicolasHug Are you sure it's a problem on the current implementation? Doesn't the decoding happen after the indexing op? My understanding is that it's not but I could be wrong.

NicolasHug · 2022-05-16T15:04:53Z

The decoding happens after the indexing, but the indexes are the same across GPUs - at least according to the docstring. So multiple GPUs will receive the same index, hence decoding the same sample (and then transforming it differently)

datumbox · 2022-05-16T16:02:50Z

I see what you mean. You mean that we decode the same thing over and over. That's expected and I agree it was happening on the previous implementation.

For a moment I thought you meant that the new API would require us decoding the part of the dataset that we throw-away, which would obviously be a problem. Sorry for the confusion.

pmeier added module: datasets module: reference scripts topic: classification prototype labels May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port `RASampler` functionality to iterable datasets #6018

Port `RASampler` functionality to iterable datasets #6018

pmeier commented May 16, 2022 •

edited by pytorch-bot bot

Loading

NicolasHug commented May 16, 2022 •

edited

Loading

datumbox commented May 16, 2022

pmeier commented May 16, 2022

NicolasHug commented May 16, 2022

datumbox commented May 16, 2022

NicolasHug commented May 16, 2022

datumbox commented May 16, 2022

Port RASampler functionality to iterable datasets #6018

Port RASampler functionality to iterable datasets #6018

Comments

pmeier commented May 16, 2022 • edited by pytorch-bot bot Loading

NicolasHug commented May 16, 2022 • edited Loading

datumbox commented May 16, 2022

pmeier commented May 16, 2022

NicolasHug commented May 16, 2022

datumbox commented May 16, 2022

NicolasHug commented May 16, 2022

datumbox commented May 16, 2022

Port `RASampler` functionality to iterable datasets #6018

Port `RASampler` functionality to iterable datasets #6018

pmeier commented May 16, 2022 •

edited by pytorch-bot bot

Loading

NicolasHug commented May 16, 2022 •

edited

Loading