Skip to content

New codec contribution guide #125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jkmacc-LANL opened this issue Nov 12, 2018 · 12 comments
Closed

New codec contribution guide #125

jkmacc-LANL opened this issue Nov 12, 2018 · 12 comments

Comments

@jkmacc-LANL
Copy link

Hi, all. If I were interested in contributing a new codec, is there a concise guide to follow, or should I just take a crack at imitating an existing one? I have no problems attempting the latter, but I feel like I might be missing part of the project docs that describes this explicitly, and I'd want to follow project best practices.

Many thanks!

@alimanfoo
Copy link
Member

Hi @jkmacc-LANL, thanks for getting in touch. There are no specific guideline as such, except for the documentation on the Codec abstract base class which is what you'd have to implement. Other than that, imitating any of the existing codecs is a fine way to go. If you have any questions while working on an implementation, please feel free to raise an issue to discuss.

One point of detail maybe worth mentioning, generally the encode() and decode() methods should accept any object implementing the new-style buffer protocol. There are some exceptions to this rule, where you may be more specific in what type of object can be passed to encode(). You may notice this when looking at the existing codecs. But on decode() you should be prepared to accept anything exposing the new-style buffer protocol. Hope that makes sense.

@jakirkham
Copy link
Member

...generally the encode() and decode() methods should accept any object implementing the new-style buffer protocol

FWIW we have been doing some work in PR ( #121 ), which should make this pretty trivial.

@jkmacc-LANL
Copy link
Author

Thanks, @alimanfoo ! I've not worked with the buffer protocol directly before. I may have questions, but I'll try to follow your existing examples. @jakirkham So, pay attention to opportunities to use to_buffer and ndarray_to_buffer?

@alimanfoo
Copy link
Member

alimanfoo commented Nov 12, 2018 via email

@jakirkham
Copy link
Member

So, pay attention to opportunities to use to_buffer and ndarray_to_buffer?

Definitely.

We will hopefully come up with a better name than to_buffer (maybe to_encodable_ndarray?).

Currently ndarray_to_buffer exists, but it will be generalized a bit after that PR lands. So should handle the buffer protocol details for the user returning an ndarray that views the original data with shape and type intact.

@jkmacc-LANL
Copy link
Author

@jakirkham I suspect I wouldn't get to this until after your PR, so I'll just follow examples that use it.

@alimanfoo I'd try to contribute and wrap some C code. Some of it does things like differencing before compression, so I may need some guidance about how to use what you've already done instead of adding duplicative code.

@alimanfoo
Copy link
Member

alimanfoo commented Nov 13, 2018 via email

@jkmacc-LANL
Copy link
Author

jkmacc-LANL commented Nov 13, 2018

Thanks. It's the compress/decompress functions from this C code, which I currently have wrapped using ctypes. The format is called "e1" compression, which encodes groups of int32 values into variable-sized chunks of bytes. It's not well represented online, but is a common form of compression for geophysical time series.

@jkmacc-LANL
Copy link
Author

Alternately, I'm finding a lot of similarities with existing compressors and filters; a gentle nudge towards any in particular for benchmark testing is also welcomed.

@jakirkham
Copy link
Member

Yeah, there's a lot of boilerplate that we can cutdown on. There also are copies occurring in a few places too. Those are the other thing we are trying to address with PR ( #121 ). This should keep the code in compressors light.

@jkmacc-LANL
Copy link
Author

Closing for now. Thanks, all!

@jakirkham
Copy link
Member

So PR ( #128 ) just went in, which should make this a bit easier.

The key additions are a few utility functions. These are ensure_ndarray and ensure_contiguous_ndarray. These create a NumPy ndarray from the data without copying by leveraging the buffer protocol. The latter, ensure_contiguous_ndarray, is intended for serializable data; so, it performs some checks along those lines.

Also added is ensure_bytes, which will do at most 1 copy as that is required to create a bytes object. An existing function was renamed, now called ndarray_copy, and largely reimplemented internally to copy data to an output buffer.

In the process of doing this work, codecs were revamped internally to use these functions. So there should be lots of examples. Please let us know if you have questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants