Bump Numcodecs requirement to 0.6.1 #347

jakirkham · 2018-11-30T17:17:43Z

As there are some critical fixes in the latest Numcodecs, this bumps our lower bound to the latest version.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
Docs build locally (e.g., run tox -e docs)
AppVeyor and Travis CI passes
Test coverage is 100% (Coveralls passes)

alimanfoo

Thanks @jakirkham. Probably worth adding a brief release note.

jakirkham · 2018-11-30T17:25:54Z

FTR saw some test failures locally that it looks like CI also encounters. Figured we'd want to look at these and determine how we want to fix them.

alimanfoo · 2018-11-30T17:33:59Z

Ah, yes, sorry, forgot about that. I expect some test failures due to #324. May be other consequences of the upgrade too. Suggest dealing with them in this PR.

jakirkham · 2018-11-30T17:52:47Z

SGTM. How about this one (MsgPack)?

        # create an object array using pickle
        z = self.create_array(shape=10, chunks=3, dtype=object, object_codec=Pickle())
        z[0] = 'foo'
>       assert z[0] == 'foo'

ref: https://travis-ci.org/zarr-developers/zarr/jobs/461883537#L2717-L2720

...or this one (Categorize)?

            with pytest.raises(RuntimeError):
                # noinspection PyStatementEffect
>               v[:]
E               Failed: DID NOT RAISE <type 'exceptions.RuntimeError'>

ref: https://travis-ci.org/zarr-developers/zarr/jobs/461883537#L2621-L2624

Previously MsgPack was turning bytes objects to unicode objects when round-tripping them. However this has been fixed in the latest version of Numcodecs. So correct this test now that MsgPack is working correctly.

jakirkham · 2018-11-30T19:11:43Z

Also there is another failure. Appears Pickle wasn't able to handle general buffer protocol conforming types in decode. Though every other codec worked fine with these (which seems to be a consequence of them using the new utility functions 😉). Apparently Pickle was not using these. Have put together PR ( zarr-developers/numcodecs#143 ) with a fix for this. Would be good to do another patch release to get that out (and anything else these failures might need).

jakirkham · 2018-12-01T00:15:11Z

Based on testing locally, it looks like we are down to the RuntimeError not being raised by an object array after removing the filters.

        for compressor in Zlib(1), Blosc():
            z = self.create_array(shape=len(data), chunks=30, dtype=object,
                                  object_codec=Categorize(greetings,
                                                          dtype=object),
                                  compressor=compressor)
            z[:] = data
            v = z.view(filters=[])
            with pytest.raises(RuntimeError):
                # noinspection PyStatementEffect
>               v[:]
E               Failed: DID NOT RAISE <type 'exceptions.RuntimeError'>

alimanfoo · 2018-12-01T14:11:23Z

Hi @jakirkham, took the liberty to push a commit that should resolve the test failure due to RuntimeError not being raised for object arrays.

Other issues should be resolved with the fixes you've merged into numcodecs, so I'll cut a numcodecs 0.6.2 release then come back and update the version requirement here.

jakirkham · 2018-12-01T18:13:04Z

Thanks @alimanfoo!

As we already ensured the `chunk` is an `ndarray` viewing the original data, there is no need for us to do that here as well. Plus the checks performed by `ensure_contiguous_ndarray` are not needed for our use case here. Particularly as we have already handled the unusual type cases above. We also don't need to constrain the buffer size. As such the only thing we really need is to flatten the array and make it contiguous, which is what we handle here directly.

As both the expected `object` case and the non-`object` case perform a `reshape` to flatten the data, go ahead and refactor that out of both cases and handle it generally. Simplifies the code a bit.

As refactoring of the `reshape` step has effectively dropped the expected `object` type case, the checks for different types is a little more complicated than needed. To fix this, basically invert and swap the case ordering. This way we can handle all generally expected types first and simply cast them. Then we can raise if an `object` type shows up and is unexpected.

jakirkham · 2018-12-02T00:50:53Z

Did a little bit of refactoring on your change. Hope that is ok. Please let me know your thoughts on that.

The `DictStore` is pretty reliant on the fact that values are immutable and can be easily compared. For example `__eq__` assumes that all contents can be compared easily. This works fine if the data is `bytes`. However it doesn't really work for `ndarray`s for example. Previously we would have stored whatever the user gave us here. This means comparisons could falldown in those cases as well (much as the example in the tutorial has highlighted on CI). Now we effectively require that the data be something that can either be coerced to `bytes` (e.g. via the new/old buffer protocol) or is `bytes` to begin with. Make sure not to force this requirement when nesting one `MutableMapping` within another.

This test case seems to be ill-posed. Anytime we store `object`s to `Array`s we require an `object_codec` to be specified. Otherwise we have no clean way to serialize the data. However this `DictStore` test breaks that assumption by explicitly storing an `object` type in it even though this would never work for the other stores (particularly when working with `Array`s). This includes in-memory Zarr `Array`s, which would be backed by `DictStore`. Given this, we go ahead and drop this test case.

Instead of using a Python `dict` as the `default` store for a Zarr `Array`, use the `DictStore`. This ensures that all blobs will be represented as `bytes` regardless of what the user provided as data. Thus things like comparisons of stores will work well in the default case.

jakirkham · 2018-12-02T01:45:55Z

On a different point, it looks like __eq__ in DictStore is assuming all items in it are comparable. This assumption runs into problems however when anything that is not easily comparable is stored in the DictStore. For instance, ndarray is comparable, but does not reduce to a bool as expected, which causes the tutorial's comparison to fail CI. There are likely other ill-posed situations we can imagine.

Looking at this example before the recent Numcodecs release, it seems the contents of DictStore were always being filled with bytes likely as an indirect consequence of applying filters to the data first. Have added a commit, which enforces this behavior intentionally using ensure_bytes from Numcodecs. Thus making this behavior something we can rely upon for things like __eq__ and such.

However there is a wrinkle with this approach. It appears we had a test that was assuming that objects could be placed in DictStores. Though this behavior doesn't really work for any of the other stores and we usually require an object_codec to be specified to even serialize an object type (specifically with Array). Given this, I've dropped that test and am contending it doesn't conform to our expectations about how stores typically should work with object types.

Also it appears that the Array's default store was a dict, which we cannot apply these fixes to. So have updated the default to DictStore. I'm not sure why this wasn't already the case, which means I'm probably missing something.

Happy to hear thoughts or receive pushback on this approach as I may very well be missing something. If you know of a better approach, would also be interested in hearing about that as well.

Also raised issue ( #348 ) to provide a nice simple example of this problem.

jakirkham · 2018-12-02T02:17:53Z

~~Seems the tutorial test is now hung up on some whitespace issue. Not totally sure how that got introduced.~~

NVM this was more fallout from the switch to DictStore as the default store for Array.

Edit: This has since been fixed.

As we are now using `DictStore` to back the `Array`, we can correctly measure how much memory it is using. So update the examples in `info` and the tutorial to show how much memory is being used. Also update the store type listed in info as well.

As Numcodecs now includes a very versatile and effective `ensure_bytes` function, there is no need to define our own in `zarr.storage` as well. So go ahead and drop it.

As this is no longer being used by `ensure_bytes` as that function was dropped, go ahead and drop `binary_type` as well.

Make use of Numcodecs' `ensure_ndarray` to take `ndarray` views onto buffers to be stored in a few cases so as to reshape them and avoid a copy (thanks to the buffer protocol).

Rewrite `buffer_size` to just use Numcodecs' `ensure_ndarray` to get an `ndarray` that views the data. Once the `ndarray` is gotten, all that is needed is to access its `nbytes` member, which returns the number of bytes that it takes up.

If the data is already a `str` instance, turn `ensure_str` into a no-op. For all other cases, make use of Numcodecs' `ensure_bytes` to aid `ensure_str` in coercing data through the buffer protocol. If we are on Python 3, then decode the `bytes` object to a `str`.

As `DictStore` now must only store `bytes` or types coercible to bytes via the buffer protocol, there is no possibility for it to have unknown sizes as `bytes` always have a known size. So drop these cases where the size can be `-1`.

jakirkham · 2018-12-02T05:10:41Z

Simplifies some of the code by making use of Numcodecs utility functions. This avoids a few copies when working with some of the stores. Also generally makes the rest easier to understand.

Make sure that datetime/timedelta arrays are cast to a type that supports the buffer protocol. Ensure this is a type that can handle all of the datetime/timedelta values and has the same itemsize.

Instead of using `ensure_ndarray`, use `ensure_contiguous_ndarray` with the stores. This ensures that datetime/timedeltas are handled by default. Also catches things like object arrays. Finally this handles flattening the array if needed.

jakirkham · 2018-12-04T00:05:14Z

Replacing with PR ( #352 ), which goes to Numcodecs 0.6.2. It also consolidates the changes from here and drops a few that have been pulled into other PRs. Namely ensuring DictStore contains only bytes for data ( #350 ) and changing the default backend of Array to DictStore ( #351 ).

Bump Numcodecs requirement to 0.6.1

7eed366

alimanfoo added this to the v2.3 milestone Nov 30, 2018

alimanfoo approved these changes Nov 30, 2018

View reviewed changes

Assert MsgPack round-trips bytes objects correctly

2552f62

Previously MsgPack was turning bytes objects to unicode objects when round-tripping them. However this has been fixed in the latest version of Numcodecs. So correct this test now that MsgPack is working correctly.

properly guard against removal of object codec

aee5ace

jakirkham added 4 commits December 1, 2018 13:21

Ensure chunk in _decode_chunk is an ndarray

bf4eee8

Refactor reshape from _decode_chunk

f3144ae

As both the expected `object` case and the non-`object` case perform a `reshape` to flatten the data, go ahead and refactor that out of both cases and handle it generally. Simplifies the code a bit.

jakirkham added 3 commits December 1, 2018 20:06

Update DictStore docs to note Array uses it

b56c2dd

jakirkham added 7 commits December 1, 2018 22:34

Update Array's info examples

cf781ad

As we are now using `DictStore` to back the `Array`, we can correctly measure how much memory it is using. So update the examples in `info` and the tutorial to show how much memory is being used. Also update the store type listed in info as well.

Drop ensure_bytes definition from zarr.storage

39e3ab8

As Numcodecs now includes a very versatile and effective `ensure_bytes` function, there is no need to define our own in `zarr.storage` as well. So go ahead and drop it.

Drop import of binary_type in zarr.storage

cb14850

As this is no longer being used by `ensure_bytes` as that function was dropped, go ahead and drop `binary_type` as well.

Take flattened array views to avoid some copies

d299b1b

Make use of Numcodecs' `ensure_ndarray` to take `ndarray` views onto buffers to be stored in a few cases so as to reshape them and avoid a copy (thanks to the buffer protocol).

Simplify buffer_size by using ensure_ndarray

205fa16

Rewrite `buffer_size` to just use Numcodecs' `ensure_ndarray` to get an `ndarray` that views the data. Once the `ndarray` is gotten, all that is needed is to access its `nbytes` member, which returns the number of bytes that it takes up.

Test getsize for unknown size

f6880b9

Drop unknown size cases from DictStore

dfa51f8

As `DictStore` now must only store `bytes` or types coercible to bytes via the buffer protocol, there is no possibility for it to have unknown sizes as `bytes` always have a known size. So drop these cases where the size can be `-1`.

jakirkham added 2 commits December 2, 2018 00:29

Cast datetime/timedelta arrays for buffer protocol

d0b8012

Make sure that datetime/timedelta arrays are cast to a type that supports the buffer protocol. Ensure this is a type that can handle all of the datetime/timedelta values and has the same itemsize.

Use ensure_contiguous_ndarray with stores

0cf5e5b

Instead of using `ensure_ndarray`, use `ensure_contiguous_ndarray` with the stores. This ensures that datetime/timedeltas are handled by default. Also catches things like object arrays. Finally this handles flattening the array if needed.

This was referenced Dec 3, 2018

Comparison of non-trivial, uncompressed, in-memory Zarr Arrays fails #348

Closed

Bump Numcodecs requirement to 0.6.2 #352

Merged

jakirkham closed this Dec 4, 2018

jakirkham deleted the use_numcodecs_0.6.1 branch December 4, 2018 00:02

jakirkham removed this from the v2.3 milestone Dec 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump Numcodecs requirement to 0.6.1 #347

Bump Numcodecs requirement to 0.6.1 #347

jakirkham commented Nov 30, 2018

alimanfoo left a comment

jakirkham commented Nov 30, 2018

alimanfoo commented Nov 30, 2018

jakirkham commented Nov 30, 2018 •

edited

Loading

jakirkham commented Nov 30, 2018

jakirkham commented Dec 1, 2018

alimanfoo commented Dec 1, 2018

jakirkham commented Dec 1, 2018

jakirkham commented Dec 2, 2018

jakirkham commented Dec 2, 2018 •

edited

Loading

jakirkham commented Dec 2, 2018 •

edited

Loading

jakirkham commented Dec 2, 2018

jakirkham commented Dec 4, 2018

Bump Numcodecs requirement to 0.6.1 #347

Bump Numcodecs requirement to 0.6.1 #347

Conversation

jakirkham commented Nov 30, 2018

alimanfoo left a comment

Choose a reason for hiding this comment

jakirkham commented Nov 30, 2018

alimanfoo commented Nov 30, 2018

jakirkham commented Nov 30, 2018 • edited Loading

jakirkham commented Nov 30, 2018

jakirkham commented Dec 1, 2018

alimanfoo commented Dec 1, 2018

jakirkham commented Dec 1, 2018

jakirkham commented Dec 2, 2018

jakirkham commented Dec 2, 2018 • edited Loading

jakirkham commented Dec 2, 2018 • edited Loading

jakirkham commented Dec 2, 2018

jakirkham commented Dec 4, 2018

jakirkham commented Nov 30, 2018 •

edited

Loading

jakirkham commented Dec 2, 2018 •

edited

Loading

jakirkham commented Dec 2, 2018 •

edited

Loading