Skip to content

Extension for data integrity #75

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rabernat opened this issue Jun 3, 2020 · 10 comments
Closed

Extension for data integrity #75

rabernat opened this issue Jun 3, 2020 · 10 comments

Comments

@rabernat
Copy link
Contributor

rabernat commented Jun 3, 2020

In a recent conversation with @balaji-gfdl, we discussed the importance of data integrity. Verifying data integrity is crucial for many data providers. For standard single-file-based formats (e.g. netcdf, hdf5, csv), the md5 checksum is the gold standard for verifying that a file is binary identical after network transmission. What best practice for data integrity will be recommend for Zarr data?

Fletcher checksums have been proposed as a possible filter / compressor option in #38. AFAIU, these would tell us whether a single chunk has been corrupted in transit. It does not address some broader questions, such as:

  • How do we verify that two arrays are the same (down to the bit) in a chunked Zarr store and a legacy data format?
  • How do we verify that two arrays with different chunk structure are the same (down to the bit)?

There is obviously an expensive way to do this: open each array in python and say np.testing.assert_array_equal. This is a lot more expensive and inconvenient than just running md5 on the command line.

It seems like this is a question that a certain type of computer scientist may already know the answer to. Some sort of hierarchical checksum, extension of the Fletcher algorithm, etc....

Whatever the technical solution, it seems essential that we have some sort of answer to this common question.

@rabernat rabernat changed the title Extensions for data integrity Extension for data integrity Jun 3, 2020
@joshmoore
Copy link
Member

see also: zarr-developers/zarr-python#392

@rabernat
Copy link
Contributor Author

rabernat commented Jun 3, 2020

Thanks for the xref! I should have checked the other issue tracker before posting...

@balaji-gfdl
Copy link

Thanks for raising this @rabernat . It is a rather critical issue for those of us who have vast legacy data stores that we might want to bring forward.

The issue raised above by @joshmoore covers chunk level checksums, but as Ryan says, we want checksums that are identical across rechunking, and against legacy data.

For legacy data, we may want to verify array integrity and also metadata integrity. md5sum on a file guarantees both, but not separately. nccmp (https://gitlab.com/remikz/nccmp) allows you to compare data or metadata separately.

For performance, a good reference number is that a simple md5sum on a file runs at 0.5 GB/s on my laptop. An archive like CMIP6 contains maybe 10 PB of data.

@jakirkham
Copy link
Member

jakirkham commented Jun 9, 2020

I'm guessing you are already familiar with .digest(...) and .hexdigest(...).

@Carreau
Copy link
Contributor

Carreau commented Jun 9, 2020

digest/hexdigest/sha1/md5 are all (based on) cryptographic hashes, they usually trade speed for robustness, unless you are worried of an attacker actually trying to compromise your data, you really do want to go for non-crypto hashes.

In particular you do want to know how the hash changes when only a subset of the data is modified. This is usually obtained by hashes being computed in a tree-like structure of blocks independently or computing order (but dependant on position).

For example xxHash is about 10x faster than md5. Murmurhash has also been mentioned in the other thread. One of the issues with storing the hash in the attribute file is locking, as you likely want it per-chunks. And if you want to store per-group hashes, then you also need a atomic update of the group level metadata.

@jakirkham
Copy link
Member

All good points. It's certainly possible to extend these to other hashes as needed. I guess the first question is whether these functions are close to our intended use case here. If not, what would we want to see instead?

@rabernat
Copy link
Contributor Author

I just learned about homeomorphic hashing

A homomorphic hash is a construction that's simple in principle: a hash function such that you can compute the hash of a composite block from the hashes of the individual blocks.

There is also a paper about it
https://ieeexplore.ieee.org/document/1301326
https://www.cs.princeton.edu/~mfreed/docs/authcodes-sp04.pdf

@rabernat
Copy link
Contributor Author

rabernat commented Jan 27, 2021

I'm guessing you are already familiar with .digest(...) and .hexdigest(...).

These don't do quite what I want:

import zarr
z1 = zarr.ones(shape=(10000, 10000), chunks=(1000, 1000))
z2 = zarr.ones(shape=(10000, 10000), chunks=(2000, 2000))
assert z1.digest() == z2.digest()  # fails

I'm looking for a hash that is independent of chunks, such that I can rechunk the array and the verify its integrity.

@joshmoore
Copy link
Member

I'm just reaching this point as well, @rabernat. Did you come up with a solution?

Within trying to work this into the codec/filtering code, my current best idea is to create a secondary array which store hash sums for some chunking of the primary array. Checking those checksums may be suboptimal if a new chunking has been chosen for the primary array, but at least it'll be possible to calculate the checksums in parallel.

@rabernat
Copy link
Contributor Author

rabernat commented Jul 30, 2021

I will close this as a duplicate of zarr-developers/zarr-python#392. Can continue discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants