-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Opening from zarr.ZipStore fails to read (store???) unicode characters #3815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @hmaarrfk , I don't know this area of code well but attempting to ensure that we resolve the question. Should this be an xarray or a zarr issue? Is there something you think xarray is doing incorrectly here? |
My guess is that that xarray might be trying to write to the store character by character??? Otherwise, not too sure. |
Do you have a reproducible example? Otherwise I'm not sure how to help resolve... |
I thought I provided it, but in either case, here is my traceback In [3]: import xarray as xr
...: import zarr
...: x = xr.Dataset()
...: x['hello'] = 'world'
...: x
...: with zarr.ZipStore('test_store.zip', mode='w') as store:
...: x.to_zarr(store)
...: with zarr.ZipStore('test_store.zip', mode='r') as store:
...: x_read = xr.open_zarr(store).compute()
...:
---------------------------------------------------------------------------
BadZipFile Traceback (most recent call last)
<ipython-input-3-5ad5a0456766> in <module>
7 x.to_zarr(store)
8 with zarr.ZipStore('test_store.zip', mode='r') as store:
----> 9 x_read = xr.open_zarr(store).compute()
10
~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/xarray/core/dataset.py in compute(self, **kwargs)
805 """
806 new = self.copy(deep=False)
--> 807 return new.load(**kwargs)
808
809 def _persist_inplace(self, **kwargs) -> "Dataset":
~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/xarray/core/dataset.py in load(self, **kwargs)
657 for k, v in self.variables.items():
658 if k not in lazy_data:
--> 659 v.load()
660
661 return self
~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/xarray/core/variable.py in load(self, **kwargs)
373 self._data = as_compatible_data(self._data.compute(**kwargs))
374 elif not hasattr(self._data, "__array_function__"):
--> 375 self._data = np.asarray(self._data)
376 return self
377
~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/xarray/core/indexing.py in __array__(self, dtype)
555 def __array__(self, dtype=None):
556 array = as_indexable(self.array)
--> 557 return np.asarray(array[self.key], dtype=None)
558
559 def transpose(self, order):
~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/xarray/backends/zarr.py in __getitem__(self, key)
47 array = self.get_array()
48 if isinstance(key, indexing.BasicIndexer):
---> 49 return array[key.tuple]
50 elif isinstance(key, indexing.VectorizedIndexer):
51 return array.vindex[
~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/zarr/core.py in __getitem__(self, selection)
570
571 fields, selection = pop_fields(selection)
--> 572 return self.get_basic_selection(selection, fields=fields)
573
574 def get_basic_selection(self, selection=Ellipsis, out=None, fields=None):
~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/zarr/core.py in get_basic_selection(self, selection, out, fields)
693 if self._shape == ():
694 return self._get_basic_selection_zd(selection=selection, out=out,
--> 695 fields=fields)
696 else:
697 return self._get_basic_selection_nd(selection=selection, out=out,
~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/zarr/core.py in _get_basic_selection_zd(self, selection, out, fields)
709 # obtain encoded data for chunk
710 ckey = self._chunk_key((0,))
--> 711 cdata = self.chunk_store[ckey]
712
713 except KeyError:
~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/zarr/storage.py in __getitem__(self, key)
1249 with self.mutex:
1250 with self.zf.open(key) as f: # will raise KeyError
-> 1251 return f.read()
1252
1253 def __setitem__(self, key, value):
~/miniconda3/envs/mcam_dev/lib/python3.7/zipfile.py in read(self, n)
914 self._offset = 0
915 while not self._eof:
--> 916 buf += self._read1(self.MAX_N)
917 return buf
918
~/miniconda3/envs/mcam_dev/lib/python3.7/zipfile.py in _read1(self, n)
1018 if self._left <= 0:
1019 self._eof = True
-> 1020 self._update_crc(data)
1021 return data
1022
~/miniconda3/envs/mcam_dev/lib/python3.7/zipfile.py in _update_crc(self, newdata)
946 # Check the CRC if we're at the end of the file
947 if self._eof and self._running_crc != self._expected_crc:
--> 948 raise BadZipFile("Bad CRC-32 for file %r" % self.name)
949
950 def read1(self, n):
BadZipFile: Bad CRC-32 for file 'hello/0' |
See the zipstore example in my first comment |
You did supply the traceback! But it's not reproducible. And consistent with #3831 (comment), it's difficult to attribute the break on examples like these. You could try debugging into the |
hmm i didn't realize this. I"m running from conda-forge + linux. Let me try on your CIs. |
Not sure if the builds in #3888 help reproduce things or not? |
I will have to try the debugging things you mentionned some later time :/ |
@hmaarrfk my apologies, I misread your issue and it was always reproducible. I had thought it was referencing a file |
I had a quick debug but didn't completely resolve it. It works fine if you add it as an attribute; do you mean to be adding a single string as a data variable? In [8]: import xarray as xr
...: import zarr
...: x = xr.Dataset()
...: x.attrs['hello'] = 'world' # changed
...: x
...: with zarr.ZipStore('test_store.zip', mode='w') as store:
...: x.to_zarr(store)
...: with zarr.ZipStore('test_store.zip', mode='r') as store:
...: x_read = xr.open_zarr(store).compute()
...: Debugging is non-trivial because the ZipFile is closed by the time I could run |
Hmm, interesting! I've avoided attrs since they often get "lost" in computation, and don't get dragged along as rigorously as coordinates. I do have some real coordinates that are stored as strings. Thanks for the quickfeedback. Here is the reproducing code without using context managers (which auto clsoe things you know) import xarray as xr
import zarr
x = xr.Dataset()
x['hello'] = 'world'
x
with zarr.ZipStore('test_store.zip', mode='w') as store:
x.to_zarr(store)
read_store = zarr.ZipStore('test_store.zip', mode='r')
x_read = xr.open_zarr(read_store).compute()
# The error will happen before this line is executed
# read_store.close() |
Yeah, we're gradually fixing this. You can set a config to keep attrs by default, or pass |
Thanks for the alternative repro. And I'm fairly confident this is a zarr issue; if the format can't handle a zero-dimensioned coord, it should raise a specific error rather than it being a bad ZipFile |
Honestly, i've found that Keeping things like:
Seems more like an attribute, but really, it is all data, so :/. Regarding the originally issue, it seems that you are right in the sense that a 0 dimension string might be buggy in zarr itself. I guess we (when we have time) will have to dig down to find a MVC example that reproduces the issue without xarray to submit to Zarr. |
@jakirkham not sure if you have any thoughts on why the code above is bugging out. |
Sorry I don't know. Maybe @rabernat can advise? 🙂 |
I think the reason is that for zero sized arrays, you technically aren't allowed to write data to them. xarray/xarray/backends/zarr.py Line 449 in 6378a71
This means that when you create the 0 sized array, you can't actually change the value. Here is a reproducer without xarray import zarr
import numpy as np
name = 'hello'
data = np.array('world', dtype='<U5')
store = zarr.ZipStore('test_store.zip', mode='w')
root = zarr.open(store , mode='w')
zarr_array = root.create_dataset(name, shape=data.shape, dtype=data.dtype)
root[name][...] = data
zarr_array[...] Though the code path follows what xarray does in the backend. |
And actually, zarr provides a import zarr
import numpy as np
name = 'hello'
data = np.array('world', dtype='<U5')
store = zarr.ZipStore('test_store.zip', mode='w')
root = zarr.open(store , mode='w')
zarr_array = root.create_dataset(name, data=data, shape=data.shape, dtype=data.dtype)
zarr_array[...] I guess i can open upstream in zarr, but I think for catching the 0 sized array case, it is probably best to use the |
Sure an upstream issue would be welcome. Thanks for unpacking that further Mark 😀 |
See upstream: zarr-developers/zarr-python#551
It seems that using a ZipStore creates 1 byte objects for Unicode string attributes.
For example, saving the same Dataset with a DirectoryStore and a Zip Store creates an attribute for a unicode array with 20 bytes in size in the first, and 1 byte in size in the second.
In fact, ubuntu file roller isn't even allowing me to extract the files.
I have a feeling it is due to the note in the zarr documentation
https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore
MCVE Code Sample
ZipStore
Issued error
Working Directory Store example
Expected Output
The string metadata should work.
Output of
xr.show_versions()
The text was updated successfully, but these errors were encountered: