Skip to content

Include a checksum for Zarr archives in the Dataset checksum #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Jul 12, 2024

Conversation

cwognum
Copy link
Collaborator

@cwognum cwognum commented May 8, 2024

Changelogs

  • Integrated a checksum for Zarr archives in the Polaris Dataset.
  • Made it possible to lazily compute the checksum for an entire dataset or benchmark.
  • Verify the checksum of each Zarr file on download.
  • Removed broken ls caching from the PolarisFileSystem.
    • This slows down up- and downloads, but we are working on a new JWT-based system anyways.
  • Added debug logs and set logging level by default to INFO.
  • In accordance with the Apache 2 license, added a separate NOTICE and LICENSE file.

Checklist:

  • Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
  • Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
  • Update the API documentation if a new function is added, or an existing one is deleted.
  • Write concise and explanatory changelogs above.
  • If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

How it works

The Polaris implementation is heavily based on the zarr-checksum package. This algorithm computes a checksum per file in the Zarr archive and then works its way up the file system hierarchy in a deterministic order to compute a checksum for the entire archive.

Note

This implies the checksum can change, even if the content doesn't. For example because you rechunk the dataset. This fits the data integrity use case of Polaris.

The checksum per file is saved in a manifest, which is saved in the Hub DB and used by the Hub to verify the integrity of the archive on upload.

Data integrity is verified in two use-cases:

  • When downloading a single chunk from the Zarr archive, the md5sum of that single chunk is retrieved from the headers and verified.
  • When downloading the entire Zarr archive, the checksum for the entire dataset is computed and verified.

Links

cwognum added 2 commits May 8, 2024 15:48
@cwognum cwognum added the feature Annotates any PR that adds new features; Used in the release process label May 8, 2024
@cwognum cwognum self-assigned this May 8, 2024
@Andrewq11
Copy link
Contributor

This looks pretty good to me. Specifically the compute_zarr_checksum method which you mentioned is where the bulk of our custom code is. Left some general comments and clarifying questions above. For your questions:

  1. Might be missing something here but why would we need to compute the checksum on the hub? My current understanding is we compute checksums locally and compare that against some stored checksum we have in the database.

  2. Not the biggest cryptography expert, but if we do decide to hash individual chunks, a faster hashing function might be needed depending on if the number of chunks to download becomes exceedingly large.

  3. I agree that we should distinguish them in certain scenarios. But if we do decide to distinguish, why not just then shift to individual chunk checksums only (since we work on a per chunk basis essentially)? I have some comments about this above as well.

cwognum and others added 2 commits May 9, 2024 10:38
@cwognum
Copy link
Collaborator Author

cwognum commented May 9, 2024

Thanks, @Andrewq11.

To answer your questions:

Might be missing something here but why would we need to compute the checksum on the hub?

why not just then shift to individual chunk checksums only

Because we want to know if the entire dataset was uploaded / downloaded successfully. If not, a Dataset should for example not be made public on the Hub. This means we don't only need to verify the data integrity of every individual chunk, but we also need to check if those chunks are organized as expected within the Zarr archive.

a faster hashing function might be needed depending on if the number of chunks to download becomes exceedingly large.

One issue with changing hashing functions, is that we need to make sure to Hub supports the same hashing function. One argument in favor of md5 is that I believe the md5 hash to already be returned "for free" in the headers when we fetch a file from the R2 bucket.

@Andrewq11
Copy link
Contributor

Because we want to know if the entire dataset was uploaded / downloaded successfully. If not, a Dataset should for example not be made public on the Hub.

So we would need to download the zarr archive to the hub directly? Is this the correct flow?

Client computes the checksum for the zarr archive -> uploads it to R2 -> the Hub downloads the dataset -> the Hub calculates and confirms checksums match.

If so, I could see this being an issue on the Hub when datasets get sufficiently large. I think we would likely need a dedicated service with enough memory and disk space available to complete the download and perform the check.

This means we don't only need to verify the data integrity of every individual chunk, but we also need to check if those chunks are organized as expected within the Zarr archive.

Makes sense, thank you.

One issue with changing hashing functions, is that we need to make sure to Hub supports the same hashing function. One argument in favor of md5 is that I believe the md5 hash to already be returned "for free" in the headers when we fetch a file from the R2 bucket.

That would be very handy if so. This seems like something that could be useful to benchmark in the future and see if we could get any significant gains using different hashing functions.

Copy link
Contributor

@jstlaurent jstlaurent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from a few suggestions here, Cas and I did some brainstorming about this feature.

We have two use cases for this feature:

  1. Checking Zarr chunk integrity on download in the client
  2. Checking upload integrity and completion in the Hub

In order to support both, we came to the conclusion we need two things:

Persist the whole Zarr checksum tree in the Hub

We can do this as a JSON structure in the Dataset table. This will allow us to check, on the Hub side, that every part of the Zarr archive has been uploaded, and that every part is correct.

There might be some concern about how big that JSON structure would become for the largest dataset (RXRX3), but we'll have to see how that evolves.

Use checksums on individual files, stored as metadata

Unfortunately, R2 does not support the automated checksum generation for downloaded files that S3 does. So we'll need to work around that by computing each file's checksum prior to upload, setting that as metadata, and checking it back on download.

@cwognum cwognum linked an issue May 14, 2024 that may be closed by this pull request
@cwognum
Copy link
Collaborator Author

cwognum commented Jun 5, 2024

Just came across this: Instead of manually saving the checksum to R2, we could also a codec like this one: https://numcodecs.readthedocs.io/en/latest/checksum32.html

Although this lets us verify the integrity of individual chunks, I don't think it covers our use case. We would also need to be able to lookup the checksums on the Hub to verify the entire dataset was uploaded correctly.

@cwognum cwognum changed the title Include a checksum for Zarr archives in the Dataset checksum Draft: Include a checksum for Zarr archives in the Dataset checksum Jun 26, 2024
@cwognum cwognum marked this pull request as draft June 26, 2024 23:01
@cwognum cwognum requested review from jstlaurent and Andrewq11 July 8, 2024 13:51
Copy link
Contributor

@jstlaurent jstlaurent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some questions and comments, but looks good overall to me. Nice work. 😄

@cwognum cwognum mentioned this pull request Jul 10, 2024
5 tasks
@cwognum cwognum requested review from kirahowe and jstlaurent July 12, 2024 00:12
Copy link
Contributor

@kirahowe kirahowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me. Nice work.

Copy link
Contributor

@jstlaurent jstlaurent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! 😄

cwognum and others added 2 commits July 12, 2024 10:43
@cwognum cwognum merged commit eaa6335 into main Jul 12, 2024
4 checks passed
@cwognum cwognum deleted the feat/zarr-checksum branch July 12, 2024 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Annotates any PR that adds new features; Used in the release process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancement: Improve the checksum for the Dataset (or the equality test)
4 participants