Skip to content

Full dataset takes up more storage than Google drive can store #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nabilLearns opened this issue Jan 2, 2022 · 1 comment
Open

Comments

@nabilLearns
Copy link
Owner

The full set of images provided by the NIH Clinical Center is 45.6GB, of which Google Drive only has space for 15GB. Perhaps I can make do with a smaller dataset, but this will likely result in poor classification accuracy.

TODO:

  • find a way to work with full dataset

Perhaps I can compress/resize images to a smaller size before uploading them to drive. This way, there is a tradeoff between number of images I can store and the quality of images, but I'm not sure how much I need to worry about losing image quality by resizing.

@nabilLearns
Copy link
Owner Author

One option is to use Amazon S3 for image-storage. However, managing the image-data on S3 is only free for 5GB of storage (for 12-months). One possible approach that I can take to minimize cost is to:

  1. encode images into bytes,
  2. compress the byte-representations with zarr, and then
  3. upload the compressed representations of images to S3 instead of the original images.

I am not sure how much memory I can expect to save with this approach. Perhaps this is something to investigate for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant