Skip to content

Bump Numcodecs requirement to 0.6.1 #347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from
Closed

Bump Numcodecs requirement to 0.6.1 #347

wants to merge 21 commits into from

Conversation

jakirkham
Copy link
Member

As there are some critical fixes in the latest Numcodecs, this bumps our lower bound to the latest version.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • Docs build locally (e.g., run tox -e docs)
  • AppVeyor and Travis CI passes
  • Test coverage is 100% (Coveralls passes)

@alimanfoo alimanfoo added this to the v2.3 milestone Nov 30, 2018
Copy link
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jakirkham. Probably worth adding a brief release note.

@jakirkham
Copy link
Member Author

FTR saw some test failures locally that it looks like CI also encounters. Figured we'd want to look at these and determine how we want to fix them.

@alimanfoo
Copy link
Member

Ah, yes, sorry, forgot about that. I expect some test failures due to #324. May be other consequences of the upgrade too. Suggest dealing with them in this PR.

@jakirkham
Copy link
Member Author

jakirkham commented Nov 30, 2018

SGTM. How about this one (MsgPack)?

        # create an object array using pickle
        z = self.create_array(shape=10, chunks=3, dtype=object, object_codec=Pickle())
        z[0] = 'foo'
>       assert z[0] == 'foo'

ref: https://travis-ci.org/zarr-developers/zarr/jobs/461883537#L2717-L2720

...or this one (Categorize)?

            with pytest.raises(RuntimeError):
                # noinspection PyStatementEffect
>               v[:]
E               Failed: DID NOT RAISE <type 'exceptions.RuntimeError'>

ref: https://travis-ci.org/zarr-developers/zarr/jobs/461883537#L2621-L2624

Previously MsgPack was turning bytes objects to unicode objects when
round-tripping them. However this has been fixed in the latest version
of Numcodecs. So correct this test now that MsgPack is working
correctly.
@jakirkham
Copy link
Member Author

Also there is another failure. Appears Pickle wasn't able to handle general buffer protocol conforming types in decode. Though every other codec worked fine with these (which seems to be a consequence of them using the new utility functions 😉). Apparently Pickle was not using these. Have put together PR ( zarr-developers/numcodecs#143 ) with a fix for this. Would be good to do another patch release to get that out (and anything else these failures might need).

@jakirkham
Copy link
Member Author

Based on testing locally, it looks like we are down to the RuntimeError not being raised by an object array after removing the filters.

        for compressor in Zlib(1), Blosc():
            z = self.create_array(shape=len(data), chunks=30, dtype=object,
                                  object_codec=Categorize(greetings,
                                                          dtype=object),
                                  compressor=compressor)
            z[:] = data
            v = z.view(filters=[])
            with pytest.raises(RuntimeError):
                # noinspection PyStatementEffect
>               v[:]
E               Failed: DID NOT RAISE <type 'exceptions.RuntimeError'>

@alimanfoo
Copy link
Member

Hi @jakirkham, took the liberty to push a commit that should resolve the test failure due to RuntimeError not being raised for object arrays.

Other issues should be resolved with the fixes you've merged into numcodecs, so I'll cut a numcodecs 0.6.2 release then come back and update the version requirement here.

@jakirkham
Copy link
Member Author

Thanks @alimanfoo!

As we already ensured the `chunk` is an `ndarray` viewing the original
data, there is no need for us to do that here as well. Plus the checks
performed by `ensure_contiguous_ndarray` are not needed for our use case
here. Particularly as we have already handled the unusual type cases
above. We also don't need to constrain the buffer size. As such the only
thing we really need is to flatten the array and make it contiguous,
which is what we handle here directly.
As both the expected `object` case and the non-`object` case perform a
`reshape` to flatten the data, go ahead and refactor that out of both
cases and handle it generally. Simplifies the code a bit.
As refactoring of the `reshape` step has effectively dropped the
expected `object` type case, the checks for different types is a little
more complicated than needed. To fix this, basically invert and swap the
case ordering. This way we can handle all generally expected types first
and simply cast them. Then we can raise if an `object` type shows up and
is unexpected.
@jakirkham
Copy link
Member Author

Did a little bit of refactoring on your change. Hope that is ok. Please let me know your thoughts on that.

The `DictStore` is pretty reliant on the fact that values are immutable
and can be easily compared. For example `__eq__` assumes that all
contents can be compared easily. This works fine if the data is `bytes`.
However it doesn't really work for `ndarray`s for example. Previously we
would have stored whatever the user gave us here. This means comparisons
could falldown in those cases as well (much as the example in the
tutorial has highlighted on CI). Now we effectively require that the
data be something that can either be coerced to `bytes` (e.g. via the
new/old buffer protocol) or is `bytes` to begin with. Make sure not to
force this requirement when nesting one `MutableMapping` within another.
This test case seems to be ill-posed. Anytime we store `object`s to
`Array`s we require an `object_codec` to be specified. Otherwise we have
no clean way to serialize the data. However this `DictStore` test breaks
that assumption by explicitly storing an `object` type in it even though
this would never work for the other stores (particularly when working
with `Array`s). This includes in-memory Zarr `Array`s, which would be
backed by `DictStore`. Given this, we go ahead and drop this test case.
Instead of using a Python `dict` as the `default` store for a Zarr
`Array`, use the `DictStore`. This ensures that all blobs will be
represented as `bytes` regardless of what the user provided as data.
Thus things like comparisons of stores will work well in the default
case.
@jakirkham
Copy link
Member Author

jakirkham commented Dec 2, 2018

On a different point, it looks like __eq__ in DictStore is assuming all items in it are comparable. This assumption runs into problems however when anything that is not easily comparable is stored in the DictStore. For instance, ndarray is comparable, but does not reduce to a bool as expected, which causes the tutorial's comparison to fail CI. There are likely other ill-posed situations we can imagine.

Looking at this example before the recent Numcodecs release, it seems the contents of DictStore were always being filled with bytes likely as an indirect consequence of applying filters to the data first. Have added a commit, which enforces this behavior intentionally using ensure_bytes from Numcodecs. Thus making this behavior something we can rely upon for things like __eq__ and such.

However there is a wrinkle with this approach. It appears we had a test that was assuming that objects could be placed in DictStores. Though this behavior doesn't really work for any of the other stores and we usually require an object_codec to be specified to even serialize an object type (specifically with Array). Given this, I've dropped that test and am contending it doesn't conform to our expectations about how stores typically should work with object types.

Also it appears that the Array's default store was a dict, which we cannot apply these fixes to. So have updated the default to DictStore. I'm not sure why this wasn't already the case, which means I'm probably missing something.

Happy to hear thoughts or receive pushback on this approach as I may very well be missing something. If you know of a better approach, would also be interested in hearing about that as well.

Also raised issue ( #348 ) to provide a nice simple example of this problem.

@jakirkham
Copy link
Member Author

jakirkham commented Dec 2, 2018

Seems the tutorial test is now hung up on some whitespace issue. Not totally sure how that got introduced.

NVM this was more fallout from the switch to DictStore as the default store for Array.

Edit: This has since been fixed.

As we are now using `DictStore` to back the `Array`, we can correctly
measure how much memory it is using. So update the examples in `info`
and the tutorial to show how much memory is being used. Also update the
store type listed in info as well.
As Numcodecs now includes a very versatile and effective `ensure_bytes`
function, there is no need to define our own in `zarr.storage` as well.
So go ahead and drop it.
As this is no longer being used by `ensure_bytes` as that function was
dropped, go ahead and drop `binary_type` as well.
Make use of Numcodecs' `ensure_ndarray` to take `ndarray` views onto
buffers to be stored in a few cases so as to reshape them and avoid a
copy (thanks to the buffer protocol).
Rewrite `buffer_size` to just use Numcodecs' `ensure_ndarray` to get an
`ndarray` that views the data. Once the `ndarray` is gotten, all that is
needed is to access its `nbytes` member, which returns the number of
bytes that it takes up.
If the data is already a `str` instance, turn `ensure_str` into a no-op.
For all other cases, make use of Numcodecs' `ensure_bytes` to aid
`ensure_str` in coercing data through the buffer protocol. If we are on
Python 3, then decode the `bytes` object to a `str`.
As `DictStore` now must only store `bytes` or types coercible to bytes
via the buffer protocol, there is no possibility for it to have unknown
sizes as `bytes` always have a known size. So drop these cases where the
size can be `-1`.
@jakirkham
Copy link
Member Author

Simplifies some of the code by making use of Numcodecs utility functions. This avoids a few copies when working with some of the stores. Also generally makes the rest easier to understand.

Make sure that datetime/timedelta arrays are cast to a type that
supports the buffer protocol. Ensure this is a type that can handle all
of the datetime/timedelta values and has the same itemsize.
Instead of using `ensure_ndarray`, use `ensure_contiguous_ndarray` with
the stores. This ensures that datetime/timedeltas are handled by
default. Also catches things like object arrays. Finally this handles
flattening the array if needed.
@jakirkham
Copy link
Member Author

Replacing with PR ( #352 ), which goes to Numcodecs 0.6.2. It also consolidates the changes from here and drops a few that have been pulled into other PRs. Namely ensuring DictStore contains only bytes for data ( #350 ) and changing the default backend of Array to DictStore ( #351 ).

@jakirkham jakirkham removed this from the v2.3 milestone Dec 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants