-
-
Notifications
You must be signed in to change notification settings - Fork 329
Fix bug where the checksum of zipfiles is wrong #930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bug where the checksum of zipfiles is wrong #930
Conversation
This bug is caused by incorrect length being written to the file, because Zipfile thinks the len() of the passed object is the length in bytes, but it was passing a ndarray, whose len() is the number of rows. The fix is to convert to bytes before passing to zipfile.writestr()
Hi @orenwatson, thanks for this! I've triggered the GH actions to get this tested. Do you have an idea what a failing test before your fix would look like? |
Codecov Report
@@ Coverage Diff @@
## master #930 +/- ##
=======================================
Coverage 99.94% 99.94%
=======================================
Files 32 32
Lines 11216 11222 +6
=======================================
+ Hits 11210 11216 +6
Misses 6 6
|
zarr/storage.py
Outdated
@@ -1565,7 +1565,7 @@ def __setitem__(self, key, value): | |||
else: | |||
keyinfo.external_attr = 0o644 << 16 # ?rw-r--r-- | |||
|
|||
self.zf.writestr(keyinfo, value) | |||
self.zf.writestr(keyinfo, value.tobytes()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would introduce an additional copy before writing to disk.
Guessing the issue is ZipFile
assumes data here is uint8
even if that may not be the case. Also suspect this only happens when a compressor is not used (otherwise this would already be the case)
Would suggest changing line 1554 above to add .view("u1")
to cast to uint8
. This should have the same end result (data is represented as uint8
), but not cause a copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll write a test case as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test is added now.
Thank Oren! 😄 Had a suggestion above. Could you please also include a test case that expose the original bug (thus confirming this fix)? |
It would I suppose look like writing a numpy array to a ZipStore (with [] = ), closing it, and trying to open it again? I was getting a exception thrown from zipfile library because of a failed cksum |
Looking at the code in issue ( #931 ). Wondering what would happen if the file were closed, reopened, and entry |
Sounds like a good test case then 🙂 |
Hello @orenwatson! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2022-01-06 09:06:26 UTC |
e2e3b5a
to
6dee888
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks Oren! 😄
@@ -1549,6 +1549,13 @@ def test_permissions(self): | |||
assert perm == '0o40775' | |||
z.close() | |||
|
|||
def test_store_and_retrieve_ndarray(self): | |||
store = ZipStore('data/store.zip') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure why some of these tests use data/store.zip
explicitly here, but I think for your test, @orenwatson, it would be best to use the self.create_store()
method so that you are testing a fresh zip, independently of other tests.
I can confirm though that this test fails without this PR and passes with it. 👍
NB: A second run with this PR raises the warning:
/usr/local/anaconda3/envs/z/lib/python3.9/zipfile.py:1505: UserWarning: Duplicate name: 'foo'
return self._open_to_write(zinfo, force_zip64=force_zip64)
which is what makes me think that create_store
would be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point Josh 👍
I think with test_permissions
this is needed because the file needs to be flushed/closed and then reopened for the test (so it doesn't follow the norm). Though agree we should be able to use create_store
here
Edit: As an aside I think this all precedes moving to pytest. So we can probably do some maintenance on this kind of thing to improve the ergonomics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. If the test is passing, I say we release @orenwatson (and add a grand "Huzzah!" in thanks 🎉) and we can handle the cleanup separately.
There is a bug causing incorrect length being written to the zip file, because Zipfile thinks the len() of the passed object is the length in bytes, but it was passing a ndarray, whose len() is the number of rows. The fix is to convert to bytes before passing to zipfile.writestr()
More details on how to reproduce this bug are here:
#931