-
-
Notifications
You must be signed in to change notification settings - Fork 329
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Storing very large image datasets on S3 #623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @AndreiBarsan! Quick question, speaking just conceptually, can you think of your dataset as a single N x H x W array? I.e., a set of images, stacked along the first dimension? If so, do you expect your users to always read one whole image at a time? I.e., every read operation will be something like |
Hi @alimanfoo, thanks for the quick response! Yes, the dataset conceptually is The main use case would be people training ML models on this dataset using PyTorch, which uses a pool of workers to read samples from the disk/S3/whatever in order to build mini-batches of data for training. Each worker loads one sample at a time, selected uniformly at random from the full dataset. |
IIUC, your raw images are all the same size ( The "proper" way to do this with Zarr would be to use one of Zarr's compressors (via numcodecs). That way the Zarr array retains its regular If you don't go this route, it's not clear to me you actually need Zarr. You might be best off just using s3fs directly to read / write your compressed bytes on S3, and then have your own logic for decompression. |
Yes.
I think this would definitely be possible! I encode the images to bytes with just 2-3 lines of Python + Pillow, so this sounds doable.
I see what you mean. But if I let Zarr handle the compression, if a user is reading random samples from the dataset, then for each sample Zarr would fetch its whole chunk (say, 10+ images to get decently sized chunks), right? Or is there a way for Zarr to do partial S3 reads? Otherwise the user would end up having to discard most of the bites they read, right? |
You're correct that Zarr doesn't support partial chunk reads (yet). Your chunk size choice should be informed by your anticipated usage pattern. What is N? |
Numcodecs encodes / decodes numpy arrays. If you go this route, you could either:
We'd be happy to help you over at https://github.com/zarr-developers/numcodecs/. |
Thank you for the follow-up. N here is the number of samples in the dataset. It can be up to 10M in our case. Thank you for the info about numcodecs, Ryan, we will look into that! For now, it seems that the S3 bottleneck is actually the number of requests made during training, and not the number of objects, so we will shift to investigating whether that can be addressed. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Problem description
Hi Zarr Team!
We are interested in storing a large ML dataset of images on S3. The size will be over 30T, likely at least 50T. The dataset is mostly images, which have to be stored compressed (WebP) to save storage. We can't use a regular
N x H x W
dataset since that would make its size an order of magnitude bigger. The workloads will mostly be ML training, so images will need to be read randomly most of the time.We are particularly interested in leveraging Zarr's ability to read parts of datasets from S3, which as I currently understand is non-trivial with other formats, such as
hdf5
.As such, we end up with a ragged array since different images end up encoded as different byte counts.
I have a couple of questions about this set-up:
I am using the latest version of Zarr, v2.4.0.
Thank you, and please let me know if I can help by providing additional information.
The text was updated successfully, but these errors were encountered: