Skip to content

Port RASampler functionality to iterable datasets #6018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pmeier opened this issue May 16, 2022 · 7 comments
Open

Port RASampler functionality to iterable datasets #6018

pmeier opened this issue May 16, 2022 · 7 comments

Comments

@pmeier
Copy link
Collaborator

pmeier commented May 16, 2022

Our current classification references use the "Repeated Augment Sampler" (RASampler) from #5051:

class RASampler(torch.utils.data.Sampler):

Since after the revamp we will have iterable- rather than map-style datasets, samplers are no longer supported.

Given that the RASampler increases accuracy, we need to support the same functionality going forward. It can probably be achieved by appending a custom RepeatedAugmentIterDataPipe to the dataset graph, but we need to make sure it works correctly with the shuffling and sharding.

cc @pmeier @YosuaMichael @datumbox @vfdev-5 @bjuncek

@NicolasHug
Copy link
Member

NicolasHug commented May 16, 2022

The RASampler is currently implemented by dispatching all copies of the same sample to different GPUs.
Different GPUs use different RNGs, so the different copies are all transformed differently, which is the end goal.

It's fairly hacky: it relies on the existence of DDP, and on the assumption that different GPUs use different RNGs. This is usually True, but there is no general guarantee for arbitrary user code.

I think we should be able to implement a datapipe-compatible RASampler by simply not sharding across GPUs (or only partially sharding). Still hacky, but not much more than what we currently have.

@datumbox
Copy link
Contributor

@NicolasHug Agreed that the previous implementation is hacky and can be simplified/improved. I've asked Philip to create a separate ticket so that I can remove the feature from the Batteries Included 2 project. Feel free to edit the description to outline potential changes on the implementation. As long as the new class provides the same result as the previous, we should be OK to make changes to its implementation.

@pmeier
Copy link
Collaborator Author

pmeier commented May 16, 2022

One thing that came to mind thinking about this is that we will be wasting quite some I/O resources when using this functionality. The prototype datasets will read the raw bytes and store them in a tensor for later decoding. Since we will throw away the majority of the samples using this technique, we wasted all these reads. I don't know if that is significant since we only read and not decode, but we should watch out for this.

@NicolasHug
Copy link
Member

Another sub-optimal thing is that we would be reading and decoding the same sample multiple times instead of once. That's already a problem with our current implementation though.

@datumbox
Copy link
Contributor

@NicolasHug Are you sure it's a problem on the current implementation? Doesn't the decoding happen after the indexing op? My understanding is that it's not but I could be wrong.

@NicolasHug
Copy link
Member

The decoding happens after the indexing, but the indexes are the same across GPUs - at least according to the docstring. So multiple GPUs will receive the same index, hence decoding the same sample (and then transforming it differently)

@datumbox
Copy link
Contributor

I see what you mean. You mean that we decode the same thing over and over. That's expected and I agree it was happening on the previous implementation.

For a moment I thought you meant that the new API would require us decoding the part of the dataset that we throw-away, which would obviously be a problem. Sorry for the confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants