Skip to content

Comparison of non-trivial, uncompressed, in-memory Zarr Arrays fails #348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jakirkham opened this issue Dec 2, 2018 · 9 comments
Closed
Labels
bug Potential issues with the zarr-python library
Milestone

Comments

@jakirkham
Copy link
Member

Comparing two in-memory Zarr arrays that do not use compressors errors out for anything other than the trivial case.

The issue being that NumPy arrays are being placed in the store (dict in this case). Even though NumPy arrays can be compared, the comparison is vectorized, which does not return a single bool that can be used in conditional expressions.

This usually works fine for other compressors as they often return bytes, which are placed in the store and can be compared to return a single bool. However this is not the case when no compressors are used or when some combination of filters and/or compressors are used that do return a NumPy array.

Minimal, reproducible code sample, a copy-pastable example if possible

In [1]: import zarr                                                             

In [2]: z1 = zarr.zeros((100,), chunks=(10,), dtype='i4', compressor=None)      

In [3]: z2 = zarr.zeros((100,), chunks=(10,), dtype='i4', compressor=None)      

In [4]: z1 == z2                                                                
Out[4]: True

In [5]: z1[0] = 5                                                               

In [6]: z2[0] = 5                                                               

In [7]: z1 == z2                                                                
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-4176b4e1c607> in <module>
----> 1 z1 == z2

/zopt/conda3/envs/test_zarr/lib/python3.6/site-packages/zarr/core.py in __eq__(self, other)
    410         return (
    411             isinstance(other, Array) and
--> 412             self.store == other.store and
    413             self.read_only == other.read_only and
    414             self.path == other.path and

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Problem description

Ideally comparison here would return True or False instead of erroring out.

Version and installation information

Please provide the following:

  • Value of zarr.__version__: 2.2.0
  • Value of numcodecs.__version__: 0.5.5
  • Version of Python interpreter: 3.6.7
  • Operating system (Linux/Windows/Mac): macOS
  • How Zarr was installed (e.g., "using pip into virtual environment", or "using conda"): Using Conda

Also, if you think it might be relevant, please provide the output from pip freeze or
conda env export depending on which was used to install Zarr.

name: test_zarr
channels:
  - conda-forge
  - defaults
dependencies:
  - appnope=0.1.0=py36_1000
  - asciitree=0.3.3=py_2
  - backcall=0.1.0=py_0
  - blas=1.1=openblas
  - ca-certificates=2018.11.29=ha4d7672_0
  - certifi=2018.11.29=py36_1000
  - decorator=4.3.0=py_0
  - fasteners=0.14.1=py_3
  - ipython=7.2.0=py36h24bf2e0_1000
  - ipython_genutils=0.2.0=py_1
  - jedi=0.13.1=py36_1000
  - libffi=3.2.1=hfc679d8_5
  - libgfortran=3.0.0=1
  - monotonic=1.5=py_0
  - msgpack-python=0.6.0=py36h2d50403_0
  - ncurses=6.1=hfc679d8_1
  - numcodecs=0.5.5=py36hfc679d8_1
  - numpy=1.15.4=py36_blas_openblashb06ca3d_0
  - openblas=0.3.3=ha44fe06_1
  - openssl=1.0.2p=h470a237_1
  - parso=0.3.1=py_0
  - pexpect=4.6.0=py36_1000
  - pickleshare=0.7.5=py36_1000
  - prompt_toolkit=2.0.7=py_0
  - ptyprocess=0.6.0=py36_1000
  - pygments=2.3.0=py_0
  - python=3.6.7=h5001a0f_1
  - readline=7.0=haf1bffa_1
  - setuptools=40.6.2=py36_0
  - six=1.11.0=py36_1001
  - sqlite=3.26.0=hb1c47c0_0
  - tk=8.6.9=ha92aebf_0
  - traitlets=4.3.2=py36_1000
  - wcwidth=0.1.7=py_1
  - xz=5.2.4=h470a237_1
  - zarr=2.2.0=py_1
  - zlib=1.2.11=h470a237_3
@jakirkham
Copy link
Member Author

A couple of options we might consider here are as follows.

  1. Convert all in-memory stored data to bytes.
  2. Take memoryviews of all stored data.
  3. Update comparisons of stores to handle NumPy's ndarrays.
  4. ?

Option 1 can be nice as this is consistent with the behavior of compressors currently and even more so historically. It also maps very well to what we already do with other stores where we are also writing out blobs of bytes. Specifically for the in-memory case this is nice as the content in the store is immutable; so, someone can't accidentally change the stores content by messing with a NumPy array in the store. A major downside is creating a bytes object requires a copy. Considering this is the in-memory case, copies could be pretty important. That said, it appears we already take one copy if there are no filters. ( 3c00d52 )

Option 2 can be nice from the standpoint of avoiding copies. Also memoryviews are easy to compare on Python 2/3. Plus they can be turned back into a NumPy ndarray (possibly allowing writing) pretty easily. However we still run the risk of users unintentionally messing with the data in the store as noted in issue ( #79 ). This could potentially be avoided by copying.

Option 3 would avoid making any changes to how we store data currently. However this could be pretty complicated to implement. Plus stores would need to start learning about NumPy ndarrays and possibly other things we decide to store in them in the future. Not to mention there may be other methods of the store that would need to be modified similarly. This would also mean that other MutableMapping implementations beyond the ones we have would be difficult if not impossible to support without converting them or asking the user to write such functionality.

There may be other options that have not been listed, which would be interesting to evaluate and discuss.

@jakirkham jakirkham added the bug Potential issues with the zarr-python library label Dec 2, 2018
@jakirkham jakirkham added this to the v2.3 milestone Dec 3, 2018
@jakirkham
Copy link
Member Author

Have marked this as a release blocker as this is causing problems for the Numcodecs upgrade. ( #347 ) We effectively have a solution modeled after option 1 in that PR, but I don't think we should be beholden to that implementation or the choices made there when it comes to solving this issue.

@alimanfoo
Copy link
Member

Thanks @jakirkham for capturing this so clearly.

FWIW I think option 1 would be very reasonable, it seems like a good idea to ensure in general that the store has a reference to data that isn't going to change. Note that many of the compression codecs (Blosc, Zlib, LZMA, LZ4, Zstd) did previously return a bytes object from encode(), and so in cases where a compressor is configured, the store would not need to make a copy, it would already be given a bytes object by the compressor and could just keep that. However, we changed that in zarr-developers/numcodecs#136 and zarr-developers/numcodecs#144 so now an ndarray is returned, in which case obtaining bytes would require a copy. I wonder if we should back out those changes, and just let the codecs return bytes as they did previously.

Option 2 could also be good, I guess we could check to see if the memoryview was writeable and only take a copy if it was.

Agree option 3 seems not so good. Better to have a clean solution that allows store comparison and ensures data are immutable IMO.

@jakirkham
Copy link
Member Author

Completely agree.

Probably option 1 is the most logical.

Option 2 we'd probably have an ndarray first (or at least could get one easily). So checking whether it is read-only should be straightforward. We could make the copied ndarray in the no filter case read-only as well; since, we know it cannot be altered externally. Though that check could be dropped for the option 1 case.

The next question is where we should perform this coercion. Should it happen the Array or should it happen in the store? This is certainly a point where the discussion here and issue ( #349 ) converge a bit. Happy to have this discussion in whichever issue makes most sense.

However, we changed that in zarr-developers/numcodecs#136 and zarr-developers/numcodecs#144 so now an ndarray is returned, in which case obtaining bytes would require a copy. I wonder if we should back out those changes, and just let the codecs return bytes as they did previously.

It's a good point. Have also been thinking about whether that is still a good idea. Happy to back that out if that is best.

@jakirkham
Copy link
Member Author

As a step in this direction, have added PR ( #350 ), which ensures the contents of DictStore are only bytes.

@alimanfoo
Copy link
Member

alimanfoo commented Dec 3, 2018 via email

@jakirkham
Copy link
Member Author

Thanks for confirming. Agree that seems like the most reasonable option.

@jakirkham
Copy link
Member Author

It seems that comparisons fail for a few stores. Added a test to demonstrate this in PR ( #357 ).

@jakirkham
Copy link
Member Author

The original issue has been fixed by PR ( #359 ). A similar fix has been applied to DictStore with PR ( #350 ). Both the dict and DictStore cases test that all of their values are bytes, which ensures comparison is possible, but does not test comparison explicitly.

Testing of comparisons is being discussed in PR ( #357 ), but is not yet complete. However good discussion is already being had there.

Am going to go ahead and close this as the original issue is fixed. Associated testing can continue to be discussed in the aforementioned PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

2 participants