Extension for data integrity #75

rabernat · 2020-06-03T20:32:44Z

In a recent conversation with @balaji-gfdl, we discussed the importance of data integrity. Verifying data integrity is crucial for many data providers. For standard single-file-based formats (e.g. netcdf, hdf5, csv), the md5 checksum is the gold standard for verifying that a file is binary identical after network transmission. What best practice for data integrity will be recommend for Zarr data?

Fletcher checksums have been proposed as a possible filter / compressor option in #38. AFAIU, these would tell us whether a single chunk has been corrupted in transit. It does not address some broader questions, such as:

How do we verify that two arrays are the same (down to the bit) in a chunked Zarr store and a legacy data format?
How do we verify that two arrays with different chunk structure are the same (down to the bit)?

There is obviously an expensive way to do this: open each array in python and say np.testing.assert_array_equal. This is a lot more expensive and inconvenient than just running md5 on the command line.

It seems like this is a question that a certain type of computer scientist may already know the answer to. Some sort of hierarchical checksum, extension of the Fletcher algorithm, etc....

Whatever the technical solution, it seems essential that we have some sort of answer to this common question.

The text was updated successfully, but these errors were encountered:

joshmoore · 2020-06-03T20:50:54Z

see also: zarr-developers/zarr-python#392

rabernat · 2020-06-03T21:06:05Z

Thanks for the xref! I should have checked the other issue tracker before posting...

balaji-gfdl · 2020-06-03T21:08:04Z

Thanks for raising this @rabernat . It is a rather critical issue for those of us who have vast legacy data stores that we might want to bring forward.

The issue raised above by @joshmoore covers chunk level checksums, but as Ryan says, we want checksums that are identical across rechunking, and against legacy data.

For legacy data, we may want to verify array integrity and also metadata integrity. md5sum on a file guarantees both, but not separately. nccmp (https://gitlab.com/remikz/nccmp) allows you to compare data or metadata separately.

For performance, a good reference number is that a simple md5sum on a file runs at 0.5 GB/s on my laptop. An archive like CMIP6 contains maybe 10 PB of data.

jakirkham · 2020-06-09T17:48:37Z

I'm guessing you are already familiar with .digest(...) and .hexdigest(...).

Carreau · 2020-06-09T21:54:59Z

digest/hexdigest/sha1/md5 are all (based on) cryptographic hashes, they usually trade speed for robustness, unless you are worried of an attacker actually trying to compromise your data, you really do want to go for non-crypto hashes.

In particular you do want to know how the hash changes when only a subset of the data is modified. This is usually obtained by hashes being computed in a tree-like structure of blocks independently or computing order (but dependant on position).

For example xxHash is about 10x faster than md5. Murmurhash has also been mentioned in the other thread. One of the issues with storing the hash in the attribute file is locking, as you likely want it per-chunks. And if you want to store per-group hashes, then you also need a atomic update of the group level metadata.

jakirkham · 2020-06-09T22:33:07Z

All good points. It's certainly possible to extend these to other hashes as needed. I guess the first question is whether these functions are close to our intended use case here. If not, what would we want to see instead?

rabernat · 2021-01-27T03:58:03Z

I just learned about homeomorphic hashing

A homomorphic hash is a construction that's simple in principle: a hash function such that you can compute the hash of a composite block from the hashes of the individual blocks.

There is also a paper about it
https://ieeexplore.ieee.org/document/1301326
https://www.cs.princeton.edu/~mfreed/docs/authcodes-sp04.pdf

rabernat · 2021-01-27T04:01:32Z

I'm guessing you are already familiar with .digest(...) and .hexdigest(...).

These don't do quite what I want:

import zarr
z1 = zarr.ones(shape=(10000, 10000), chunks=(1000, 1000))
z2 = zarr.ones(shape=(10000, 10000), chunks=(2000, 2000))
assert z1.digest() == z2.digest()  # fails

I'm looking for a hash that is independent of chunks, such that I can rechunk the array and the verify its integrity.

joshmoore · 2021-04-19T12:29:32Z

I'm just reaching this point as well, @rabernat. Did you come up with a solution?

Within trying to work this into the codec/filtering code, my current best idea is to create a secondary array which store hash sums for some chunking of the primary array. Checking those checksums may be suboptimal if a new chunking has been chosen for the primary array, but at least it'll be possible to calculate the checksums in parallel.

rabernat · 2021-07-30T13:01:33Z

I will close this as a duplicate of zarr-developers/zarr-python#392. Can continue discussion there.

rabernat changed the title ~~Extensions for data integrity~~ Extension for data integrity Jun 3, 2020

joshmoore mentioned this issue Apr 20, 2021

Add nested data generation and tests (zarr-python and zarrita) zarr-developers/zarr_implementations#33

Merged

rabernat closed this as completed Jul 30, 2021

joshmoore mentioned this issue Aug 3, 2021

checksums for chunks zarr-developers/zarr-python#392

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extension for data integrity #75

Extension for data integrity #75

rabernat commented Jun 3, 2020

joshmoore commented Jun 3, 2020

Uh oh!

rabernat commented Jun 3, 2020

Uh oh!

balaji-gfdl commented Jun 3, 2020

Uh oh!

jakirkham commented Jun 9, 2020 •

edited

Loading

Uh oh!

Carreau commented Jun 9, 2020

Uh oh!

jakirkham commented Jun 9, 2020

Uh oh!

rabernat commented Jan 27, 2021

Uh oh!

rabernat commented Jan 27, 2021 •

edited

Loading

Uh oh!

joshmoore commented Apr 19, 2021

Uh oh!

rabernat commented Jul 30, 2021 •

edited by joshmoore

Loading

Uh oh!

Extension for data integrity #75

Extension for data integrity #75

Comments

rabernat commented Jun 3, 2020

joshmoore commented Jun 3, 2020

Uh oh!

rabernat commented Jun 3, 2020

Uh oh!

balaji-gfdl commented Jun 3, 2020

Uh oh!

jakirkham commented Jun 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Carreau commented Jun 9, 2020

Uh oh!

jakirkham commented Jun 9, 2020

Uh oh!

rabernat commented Jan 27, 2021

Uh oh!

rabernat commented Jan 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joshmoore commented Apr 19, 2021

Uh oh!

rabernat commented Jul 30, 2021 • edited by joshmoore Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakirkham commented Jun 9, 2020 •

edited

Loading

rabernat commented Jan 27, 2021 •

edited

Loading

rabernat commented Jul 30, 2021 •

edited by joshmoore

Loading