-
Notifications
You must be signed in to change notification settings - Fork 9
Include a checksum for Zarr archives in the Dataset checksum #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Mostly uses the code from the zarr-checksum library
This looks pretty good to me. Specifically the
|
Co-authored-by: Andrew Quirke <[email protected]>
Co-authored-by: Andrew Quirke <[email protected]>
Thanks, @Andrewq11. To answer your questions:
Because we want to know if the entire dataset was uploaded / downloaded successfully. If not, a Dataset should for example not be made public on the Hub. This means we don't only need to verify the data integrity of every individual chunk, but we also need to check if those chunks are organized as expected within the Zarr archive.
One issue with changing hashing functions, is that we need to make sure to Hub supports the same hashing function. One argument in favor of md5 is that I believe the md5 hash to already be returned "for free" in the headers when we fetch a file from the R2 bucket. |
So we would need to download the zarr archive to the hub directly? Is this the correct flow? Client computes the checksum for the zarr archive -> uploads it to R2 -> the Hub downloads the dataset -> the Hub calculates and confirms checksums match. If so, I could see this being an issue on the Hub when datasets get sufficiently large. I think we would likely need a dedicated service with enough memory and disk space available to complete the download and perform the check.
Makes sense, thank you.
That would be very handy if so. This seems like something that could be useful to benchmark in the future and see if we could get any significant gains using different hashing functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from a few suggestions here, Cas and I did some brainstorming about this feature.
We have two use cases for this feature:
- Checking Zarr chunk integrity on download in the client
- Checking upload integrity and completion in the Hub
In order to support both, we came to the conclusion we need two things:
Persist the whole Zarr checksum tree in the Hub
We can do this as a JSON structure in the Dataset table. This will allow us to check, on the Hub side, that every part of the Zarr archive has been uploaded, and that every part is correct.
There might be some concern about how big that JSON structure would become for the largest dataset (RXRX3), but we'll have to see how that evolves.
Use checksums on individual files, stored as metadata
Unfortunately, R2 does not support the automated checksum generation for downloaded files that S3 does. So we'll need to work around that by computing each file's checksum prior to upload, setting that as metadata, and checking it back on download.
Just came across this: Instead of manually saving the checksum to R2, we could also a codec like this one: https://numcodecs.readthedocs.io/en/latest/checksum32.html Although this lets us verify the integrity of individual chunks, I don't think it covers our use case. We would also need to be able to lookup the checksums on the Hub to verify the entire dataset was uploaded correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some questions and comments, but looks good overall to me. Nice work. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me. Nice work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! 😄
Co-authored-by: Julien St-Laurent <[email protected]>
Changelogs
ls
caching from thePolarisFileSystem
.INFO
.Checklist:
feature
,fix
ortest
(or ask a maintainer to do it for you).How it works
The Polaris implementation is heavily based on the
zarr-checksum
package. This algorithm computes a checksum per file in the Zarr archive and then works its way up the file system hierarchy in a deterministic order to compute a checksum for the entire archive.Note
This implies the checksum can change, even if the content doesn't. For example because you rechunk the dataset. This fits the data integrity use case of Polaris.
The checksum per file is saved in a manifest, which is saved in the Hub DB and used by the Hub to verify the integrity of the archive on upload.
Data integrity is verified in two use-cases:
Links