Include a checksum for Zarr archives in the Dataset checksum #102

cwognum · 2024-05-08T19:58:40Z

Changelogs

Integrated a checksum for Zarr archives in the Polaris Dataset.
Made it possible to lazily compute the checksum for an entire dataset or benchmark.
Verify the checksum of each Zarr file on download.
Removed broken ls caching from the PolarisFileSystem.
- This slows down up- and downloads, but we are working on a new JWT-based system anyways.
Added debug logs and set logging level by default to INFO.
In accordance with the Apache 2 license, added a separate NOTICE and LICENSE file.

Checklist:

Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
Update the API documentation if a new function is added, or an existing one is deleted.
Write concise and explanatory changelogs above.
If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

How it works

The Polaris implementation is heavily based on the zarr-checksum package. This algorithm computes a checksum per file in the Zarr archive and then works its way up the file system hierarchy in a deterministic order to compute a checksum for the entire archive.

Note

This implies the checksum can change, even if the content doesn't. For example because you rechunk the dataset. This fits the data integrity use case of Polaris.

The checksum per file is saved in a manifest, which is saved in the Hub DB and used by the Hub to verify the integrity of the archive on upload.

Data integrity is verified in two use-cases:

When downloading a single chunk from the Zarr archive, the md5sum of that single chunk is retrieved from the headers and verified.
When downloading the entire Zarr archive, the checksum for the entire dataset is computed and verified.

Links

Related PR for Hub: https://github.com/polaris-hub/polaris-hub/pull/377

Mostly uses the code from the zarr-checksum library

polaris/dataset/_dataset.py

polaris/dataset/zarr/_checksum.py

Andrewq11 · 2024-05-09T13:59:16Z

This looks pretty good to me. Specifically the compute_zarr_checksum method which you mentioned is where the bulk of our custom code is. Left some general comments and clarifying questions above. For your questions:

Might be missing something here but why would we need to compute the checksum on the hub? My current understanding is we compute checksums locally and compare that against some stored checksum we have in the database.
Not the biggest cryptography expert, but if we do decide to hash individual chunks, a faster hashing function might be needed depending on if the number of chunks to download becomes exceedingly large.
I agree that we should distinguish them in certain scenarios. But if we do decide to distinguish, why not just then shift to individual chunk checksums only (since we work on a per chunk basis essentially)? I have some comments about this above as well.

Co-authored-by: Andrew Quirke <[email protected]>

cwognum · 2024-05-09T15:58:07Z

Thanks, @Andrewq11.

To answer your questions:

Might be missing something here but why would we need to compute the checksum on the hub?

why not just then shift to individual chunk checksums only

Because we want to know if the entire dataset was uploaded / downloaded successfully. If not, a Dataset should for example not be made public on the Hub. This means we don't only need to verify the data integrity of every individual chunk, but we also need to check if those chunks are organized as expected within the Zarr archive.

a faster hashing function might be needed depending on if the number of chunks to download becomes exceedingly large.

One issue with changing hashing functions, is that we need to make sure to Hub supports the same hashing function. One argument in favor of md5 is that I believe the md5 hash to already be returned "for free" in the headers when we fetch a file from the R2 bucket.

Andrewq11 · 2024-05-10T13:58:49Z

Because we want to know if the entire dataset was uploaded / downloaded successfully. If not, a Dataset should for example not be made public on the Hub.

So we would need to download the zarr archive to the hub directly? Is this the correct flow?

Client computes the checksum for the zarr archive -> uploads it to R2 -> the Hub downloads the dataset -> the Hub calculates and confirms checksums match.

If so, I could see this being an issue on the Hub when datasets get sufficiently large. I think we would likely need a dedicated service with enough memory and disk space available to complete the download and perform the check.

This means we don't only need to verify the data integrity of every individual chunk, but we also need to check if those chunks are organized as expected within the Zarr archive.

Makes sense, thank you.

One issue with changing hashing functions, is that we need to make sure to Hub supports the same hashing function. One argument in favor of md5 is that I believe the md5 hash to already be returned "for free" in the headers when we fetch a file from the R2 bucket.

That would be very handy if so. This seems like something that could be useful to benchmark in the future and see if we could get any significant gains using different hashing functions.

jstlaurent

Aside from a few suggestions here, Cas and I did some brainstorming about this feature.

We have two use cases for this feature:

Checking Zarr chunk integrity on download in the client
Checking upload integrity and completion in the Hub

In order to support both, we came to the conclusion we need two things:

Persist the whole Zarr checksum tree in the Hub

We can do this as a JSON structure in the Dataset table. This will allow us to check, on the Hub side, that every part of the Zarr archive has been uploaded, and that every part is correct.

There might be some concern about how big that JSON structure would become for the largest dataset (RXRX3), but we'll have to see how that evolves.

Use checksums on individual files, stored as metadata

Unfortunately, R2 does not support the automated checksum generation for downloaded files that S3 does. So we'll need to work around that by computing each file's checksum prior to upload, setting that as metadata, and checking it back on download.

polaris/dataset/zarr/_checksum.py

polaris/dataset/_dataset.py

cwognum · 2024-06-05T18:04:34Z

Just came across this: Instead of manually saving the checksum to R2, we could also a codec like this one: https://numcodecs.readthedocs.io/en/latest/checksum32.html

Although this lets us verify the integrity of individual chunks, I don't think it covers our use case. We would also need to be able to lookup the checksums on the Hub to verify the entire dataset was uploaded correctly.

jstlaurent

Added some questions and comments, but looks good overall to me. Nice work. 😄

polaris/benchmark/_base.py

polaris/dataset/_dataset.py

polaris/hub/client.py

polaris/loader/load.py

polaris/hub/polarisfs.py

polaris/dataset/_dataset.py

polaris/benchmark/_base.py

polaris/dataset/_dataset.py

polaris/dataset/_subset.py

polaris/dataset/zarr/_checksum.py

polaris/hub/client.py

polaris/utils/misc.py

kirahowe

This looks great to me. Nice work.

polaris/dataset/_dataset.py

jstlaurent

Very nice! 😄

polaris/_mixins.py

Co-authored-by: Julien St-Laurent <[email protected]>

cwognum added 2 commits May 8, 2024 15:48

First implementation of the zarr checksum

4e37683

Mostly uses the code from the zarr-checksum library

Removed left-over print statements

22156f8

cwognum added the feature Annotates any PR that adds new features; Used in the release process label May 8, 2024

cwognum self-assigned this May 8, 2024

cwognum requested review from jstlaurent, Andrewq11, mercuryseries and kirahowe May 8, 2024 19:58

Minor changes to docs

cefde26

Andrewq11 reviewed May 8, 2024

View reviewed changes

polaris/dataset/_dataset.py Outdated Show resolved Hide resolved

cwognum commented May 8, 2024

View reviewed changes

polaris/dataset/_dataset.py Outdated Show resolved Hide resolved

Removed unused method

269886a

Andrewq11 reviewed May 8, 2024

View reviewed changes

polaris/dataset/zarr/_checksum.py Outdated Show resolved Hide resolved

Andrewq11 reviewed May 9, 2024

View reviewed changes

polaris/dataset/zarr/_checksum.py Show resolved Hide resolved

cwognum and others added 2 commits May 9, 2024 10:38

Update polaris/dataset/_dataset.py

9bc8086

Co-authored-by: Andrew Quirke <[email protected]>

Update polaris/dataset/zarr/_checksum.py

69aea30

Co-authored-by: Andrew Quirke <[email protected]>

jstlaurent reviewed May 10, 2024

View reviewed changes

polaris/dataset/zarr/_checksum.py Outdated Show resolved Hide resolved

polaris/dataset/zarr/_checksum.py Outdated Show resolved Hide resolved

polaris/dataset/_dataset.py Show resolved Hide resolved

cwognum linked an issue May 14, 2024 that may be closed by this pull request

Enhancement: Improve the checksum for the Dataset (or the equality test) #16

Closed

cwognum added 3 commits June 26, 2024 13:40

Merge branch 'main' into feat/zarr-checksum

2529e47

Lazily compute the checksum

ad7aac6

Save the checksum per file

35246ce

cwognum changed the title ~~Include a checksum for Zarr archives in the Dataset checksum~~ Draft: Include a checksum for Zarr archives in the Dataset checksum Jun 26, 2024

cwognum marked this pull request as draft June 26, 2024 23:01

cwognum added 3 commits June 27, 2024 12:58

Merge branch 'main' into feat/zarr-checksum

e924f01

Improved docs because I kept forgetting how it works

b2aae93

Only support running the checksum algorithm locally

a7d6aef

Trigger CICD

1c10763

cwognum requested review from jstlaurent and Andrewq11 July 8, 2024 13:51

jstlaurent approved these changes Jul 8, 2024

View reviewed changes

cwognum mentioned this pull request Jul 10, 2024

Competitions feature #121

Merged

5 tasks

cwognum added 3 commits July 10, 2024 19:05

Address PR feedback

7fb571a

Merge branch 'main' into feat/zarr-checksum

fdb2d6a

Fixed import error

dfe03bc