Skip to content

Use more buffers (redux) #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

alimanfoo
Copy link
Member

@alimanfoo alimanfoo commented Nov 22, 2018

Based on work in #121, trying to make a clean split between PY2 and PY3 code paths, and make function naming more intuitive.

TODO:

  • Unit tests and/or doctests in docstrings
  • tox -e py37 passes locally
  • tox -e py27 passes locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • tox -e docs passes locally
  • AppVeyor and Travis CI passes
  • Test coverage to 100% (Coveralls passes)

@alimanfoo alimanfoo force-pushed the use-buffers-redux-alimanfoo-20181122 branch from ceb7ccc to 0153531 Compare November 22, 2018 23:39
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for playing around with this problem as well @alimanfoo. Seems you are encountering the same subtleties I ran into. Left some comments below, but wouldn't take them too seriously. They are more intended as advice. Completely understand this is still a work in progress at this stage.

Would just highlight one point that comes up below and have generally learned by playing with this problem. Namely NumPy ndarrays end up being a great solution to use for all data in codecs. They support both buffer protocols, they support more types out-of-the-box, and they are easier for us and new users to the codebase to work with. This saves on compat code, work the same on Python 2/3, are easily handled by builtin compressors, and make things like casting, reshaping, and getting bytes trivial. In short, a lot of mileage can be gotten out of ndarrays.

@alimanfoo
Copy link
Member Author

Thanks @jakirkham for the comments, this is a tricky one to get right.

FWIW I think the key issue here is that many codecs require an input that, under Python 3, exposes a new-style buffer interface onto a C contiguous block of memory. E.g., things like ZLib, BZ2 and LZMA.

A blocker from just passing a numpy array to these functions is the datetime & timedelta datatypes which you can't take a memoryview of.

If I included a conversion from datetime or timedelta to int64 within the ensure_contiguous_ndarray() function, then it would be possible to use that function everywhere instead of either ensure_memoryview() or ensure_buffer(), and so delete those two functions. Maybe that would be a good idea.

@jakirkham
Copy link
Member

If I included a conversion from datetime or timedelta to int64 within the ensure_contiguous_ndarray() function, then it would be possible to use that function everywhere instead of either ensure_memoryview() or ensure_buffer(), and so delete those two functions. Maybe that would be a good idea.

This is my temptation as well. Let's go for it. 😄

@alimanfoo
Copy link
Member Author

OK, 'tis done! I'm using ensure_contiguous_ndarray() everywhere now. ensure_memoryview() and ensure_buffer() are gone.

There were still some quirks in the BZ2 and GZip codecs that required an explicit conversion to memoryview, but I figure that's OK.

Let me know what you think.

@alimanfoo alimanfoo changed the title WIP Use more buffers (redux) Use more buffers (redux) Nov 23, 2018
@alimanfoo alimanfoo force-pushed the use-buffers-redux-alimanfoo-20181122 branch from 047ac0a to f4f6f5d Compare November 23, 2018 12:17
@alimanfoo alimanfoo mentioned this pull request Nov 23, 2018
8 tasks
@jakirkham
Copy link
Member

OK, 'tis done! I'm using ensure_contiguous_ndarray() everywhere now. ensure_memoryview() and ensure_buffer() are gone.

Beautiful! Thanks for doing that. 😄

There were still some quirks in the BZ2 and GZip codecs that required an explicit conversion to memoryview, but I figure that's OK.

Yeah, I ran into these two as well. Agree this seems fine.

Let me know what you think.

Added a few more comments above in our existing threads to keep continuity with the existing discussion. A couple other comments above as well. For the most part these are moving to minor points. IOW this is looking pretty good.

@alimanfoo alimanfoo force-pushed the use-buffers-redux-alimanfoo-20181122 branch from 1acb560 to 5084c56 Compare November 27, 2018 01:53
@alimanfoo
Copy link
Member Author

Hi @jakirkham, after some thrashing around I think I have basically converged on something very similar to what you have in #121. There are some small differences which I'm more than happy to discuss. I've also pushed on and refactored the buffer compatibility code in the cython modules, to re-use the new compat functions. I still have to address the point you raised about dealing with unicode arrays, but apart from that, I'd be very grateful if you could take a look and see if there's anything you think I've missed, or anything you think should be done differently.

@alimanfoo
Copy link
Member Author

Latest commit disallows unicode array.array, seemed like it was not worth the effort to support it.

@alimanfoo
Copy link
Member Author

In the interests of getting to a 0.6 release asap, I'd like to suggest we move forward with this PR instead of #121, mainly for the reason that I've had the chance to add in a bit more documentation and comments, so hopefully when we or someone else needs to revisit this in future, they should be able to comprehend what is being done and why. I believe that the approaches taken here and in #121 are essentially the same, with the main differences being in function naming and some minor implementation details. Please let me know if any objections. Whatever PR we take forward, primary credit for this work should go to @jakirkham.

@jakirkham
Copy link
Member

Had some pretty minor comments above. Think this is basically done at this stage.

In the interests of getting to a 0.6 release asap, I'd like to suggest we move forward with this PR instead of #121...

Sounds good to me.

...mainly for the reason that I've had the chance to add in a bit more documentation and comments, so hopefully when we or someone else needs to revisit this in future, they should be able to comprehend what is being done and why.

I think the test coverage and function naming were also better here. The API has also improved generally.

Whatever PR we take forward, primary credit for this work should go to @jakirkham.

Thanks for working on this as well. It has come a long way since that PR.

@jakirkham jakirkham added this to the 0.6.0 milestone Nov 27, 2018
@alimanfoo
Copy link
Member Author

Thanks again @jakirkham for the review. Your suggestions are all good and I've gone with them in latest commits. Thanks also for bottoming out some really gnarly Python quirks, it is amazing how deep the rabbit hole goes!

If CI passes I propose to merge if no objections.

@alimanfoo alimanfoo merged commit df60e2f into zarr-developers:master Nov 27, 2018
@alimanfoo alimanfoo deleted the use-buffers-redux-alimanfoo-20181122 branch November 27, 2018 23:16
@jakirkham
Copy link
Member

Thanks @alimanfoo! 🎉

Agree this looks great. Thanks for your hard work here. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants