Skip to content

Fix bug where the checksum of zipfiles is wrong #930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

orenwatson
Copy link
Contributor

@orenwatson orenwatson commented Jan 5, 2022

There is a bug causing incorrect length being written to the zip file, because Zipfile thinks the len() of the passed object is the length in bytes, but it was passing a ndarray, whose len() is the number of rows. The fix is to convert to bytes before passing to zipfile.writestr()

More details on how to reproduce this bug are here:
#931

This bug is caused by incorrect length being written to the file, because Zipfile thinks the len() of the passed object is the length in bytes, but it was passing a ndarray, whose len() is the number of rows. The fix is to convert to bytes before passing to zipfile.writestr()
@joshmoore
Copy link
Member

joshmoore commented Jan 5, 2022

Hi @orenwatson, thanks for this! I've triggered the GH actions to get this tested.

Do you have an idea what a failing test before your fix would look like?

@codecov
Copy link

codecov bot commented Jan 5, 2022

Codecov Report

Merging #930 (a68f7c9) into master (2d4802c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #930   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files          32       32           
  Lines       11216    11222    +6     
=======================================
+ Hits        11210    11216    +6     
  Misses          6        6           
Impacted Files Coverage Δ
zarr/storage.py 100.00% <100.00%> (ø)
zarr/tests/test_storage.py 100.00% <100.00%> (ø)

zarr/storage.py Outdated
@@ -1565,7 +1565,7 @@ def __setitem__(self, key, value):
else:
keyinfo.external_attr = 0o644 << 16 # ?rw-r--r--

self.zf.writestr(keyinfo, value)
self.zf.writestr(keyinfo, value.tobytes())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would introduce an additional copy before writing to disk.

Guessing the issue is ZipFile assumes data here is uint8 even if that may not be the case. Also suspect this only happens when a compressor is not used (otherwise this would already be the case)

Would suggest changing line 1554 above to add .view("u1") to cast to uint8. This should have the same end result (data is represented as uint8), but not cause a copy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll write a test case as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is added now.

@jakirkham
Copy link
Member

Thank Oren! 😄

Had a suggestion above. Could you please also include a test case that expose the original bug (thus confirming this fix)?

@orenwatson
Copy link
Contributor Author

It would I suppose look like writing a numpy array to a ZipStore (with [] = ), closing it, and trying to open it again? I was getting a exception thrown from zipfile library because of a failed cksum

@jakirkham
Copy link
Member

Looking at the code in issue ( #931 ). Wondering what would happen if the file were closed, reopened, and entry "foo" read?

@orenwatson
Copy link
Contributor Author

image
On my machine, it didn't even require reopening the zipstore.

@jakirkham
Copy link
Member

Sounds like a good test case then 🙂

@pep8speaks
Copy link

pep8speaks commented Jan 5, 2022

Hello @orenwatson! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-01-06 09:06:26 UTC

@orenwatson orenwatson force-pushed the orenwatson-patch-zipfile-bug branch from e2e3b5a to 6dee888 Compare January 5, 2022 21:39
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks Oren! 😄

@@ -1549,6 +1549,13 @@ def test_permissions(self):
assert perm == '0o40775'
z.close()

def test_store_and_retrieve_ndarray(self):
store = ZipStore('data/store.zip')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure why some of these tests use data/store.zip explicitly here, but I think for your test, @orenwatson, it would be best to use the self.create_store() method so that you are testing a fresh zip, independently of other tests.

I can confirm though that this test fails without this PR and passes with it. 👍

NB: A second run with this PR raises the warning:

/usr/local/anaconda3/envs/z/lib/python3.9/zipfile.py:1505: UserWarning: Duplicate name: 'foo'
  return self._open_to_write(zinfo, force_zip64=force_zip64)

which is what makes me think that create_store would be safer.

Copy link
Member

@jakirkham jakirkham Jan 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point Josh 👍

I think with test_permissions this is needed because the file needs to be flushed/closed and then reopened for the test (so it doesn't follow the norm). Though agree we should be able to use create_store here

Edit: As an aside I think this all precedes moving to pytest. So we can probably do some maintenance on this kind of thing to improve the ergonomics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. If the test is passing, I say we release @orenwatson (and add a grand "Huzzah!" in thanks 🎉) and we can handle the cleanup separately.

@joshmoore joshmoore merged commit 22ded1d into zarr-developers:master Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants