New codec contribution guide #125

jkmacc-LANL · 2018-11-12T17:40:02Z

Hi, all. If I were interested in contributing a new codec, is there a concise guide to follow, or should I just take a crack at imitating an existing one? I have no problems attempting the latter, but I feel like I might be missing part of the project docs that describes this explicitly, and I'd want to follow project best practices.

Many thanks!

alimanfoo · 2018-11-12T18:39:03Z

Hi @jkmacc-LANL, thanks for getting in touch. There are no specific guideline as such, except for the documentation on the Codec abstract base class which is what you'd have to implement. Other than that, imitating any of the existing codecs is a fine way to go. If you have any questions while working on an implementation, please feel free to raise an issue to discuss.

One point of detail maybe worth mentioning, generally the encode() and decode() methods should accept any object implementing the new-style buffer protocol. There are some exceptions to this rule, where you may be more specific in what type of object can be passed to encode(). You may notice this when looking at the existing codecs. But on decode() you should be prepared to accept anything exposing the new-style buffer protocol. Hope that makes sense.

jakirkham · 2018-11-12T18:45:45Z

...generally the encode() and decode() methods should accept any object implementing the new-style buffer protocol

FWIW we have been doing some work in PR ( #121 ), which should make this pretty trivial.

jkmacc-LANL · 2018-11-12T18:55:10Z

Thanks, @alimanfoo ! I've not worked with the buffer protocol directly before. I may have questions, but I'll try to follow your existing examples. @jakirkham So, pay attention to opportunities to use to_buffer and ndarray_to_buffer?

alimanfoo · 2018-11-12T19:58:51Z

Out of interest, will you be wrapping an existing Python package, or wrapping some C code, or doing something else?

…

On Mon, 12 Nov 2018, 18:55 jkmacc-LANL ***@***.*** wrote: Thanks, @alimanfoo <https://github.com/alimanfoo> ! I've not worked with the buffer protocol directly before. I may have questions, but I'll try to follow your existing examples. @jakirkham <https://github.com/jakirkham> So, pay attention to opportunities to use to_buffer and ndarray_to_buffer? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#125 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QuUbDj_sdZ-L7C782Gkkdr2L81Oeks5uucQPgaJpZM4YaIDO> .

jakirkham · 2018-11-12T20:01:56Z

So, pay attention to opportunities to use to_buffer and ndarray_to_buffer?

Definitely.

We will hopefully come up with a better name than to_buffer (maybe to_encodable_ndarray?).

Currently ndarray_to_buffer exists, but it will be generalized a bit after that PR lands. So should handle the buffer protocol details for the user returning an ndarray that views the original data with shape and type intact.

jkmacc-LANL · 2018-11-12T21:23:39Z

@jakirkham I suspect I wouldn't get to this until after your PR, so I'll just follow examples that use it.

@alimanfoo I'd try to contribute and wrap some C code. Some of it does things like differencing before compression, so I may need some guidance about how to use what you've already done instead of adding duplicative code.

alimanfoo · 2018-11-13T01:17:20Z

FWIW if you're wrapping C code it might be worth describing a little more the codec you'd like to add, including what (if any) existing C libraries you'd like to wrap, and what (if any) new C code you'd like to implement. There's a couple of different options for wrapping C code in numcodecs and I'd be happy to talk through.

…

On Mon, 12 Nov 2018 at 21:40, jkmacc-LANL ***@***.***> wrote: @jakirkham <https://github.com/jakirkham> I suspect I wouldn't get to this until after your PR, so I'll just follow examples that use it. @alimanfoo <https://github.com/alimanfoo> I'd try to contribute and wrap some C code. Some of it does things like differencing before compression, so I may need some guidance about how to use what you've already done instead of adding duplicative code. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#125 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8Qms8_cP_DjwSSK24gyP50dYxMUkgks5uuebcgaJpZM4YaIDO> .

-- Please feel free to resend your email and/or contact me by other means if you need an urgent reply. Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: [email protected] Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: @alimanfoo <https://twitter.com/alimanfoo>

jkmacc-LANL · 2018-11-13T15:28:56Z

Thanks. It's the compress/decompress functions from this C code, which I currently have wrapped using ctypes. The format is called "e1" compression, which encodes groups of int32 values into variable-sized chunks of bytes. It's not well represented online, but is a common form of compression for geophysical time series.

jkmacc-LANL · 2018-11-14T18:50:07Z

Alternately, I'm finding a lot of similarities with existing compressors and filters; a gentle nudge towards any in particular for benchmark testing is also welcomed.

jakirkham · 2018-11-19T15:41:42Z

Yeah, there's a lot of boilerplate that we can cutdown on. There also are copies occurring in a few places too. Those are the other thing we are trying to address with PR ( #121 ). This should keep the code in compressors light.

jkmacc-LANL · 2018-11-24T22:29:44Z

Closing for now. Thanks, all!

jakirkham · 2018-11-28T00:18:57Z

So PR ( #128 ) just went in, which should make this a bit easier.

The key additions are a few utility functions. These are ensure_ndarray and ensure_contiguous_ndarray. These create a NumPy ndarray from the data without copying by leveraging the buffer protocol. The latter, ensure_contiguous_ndarray, is intended for serializable data; so, it performs some checks along those lines.

Also added is ensure_bytes, which will do at most 1 copy as that is required to create a bytes object. An existing function was renamed, now called ndarray_copy, and largely reimplemented internally to copy data to an output buffer.

In the process of doing this work, codecs were revamped internally to use these functions. So there should be lots of examples. Please let us know if you have questions.

jkmacc-LANL closed this as completed Nov 24, 2018

This was referenced Nov 28, 2018

ZFP Compression #117

Open

Using JPEG2000 for chunk compression #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New codec contribution guide #125

New codec contribution guide #125

jkmacc-LANL commented Nov 12, 2018

alimanfoo commented Nov 12, 2018

jakirkham commented Nov 12, 2018

jkmacc-LANL commented Nov 12, 2018

alimanfoo commented Nov 12, 2018 via email

jakirkham commented Nov 12, 2018

jkmacc-LANL commented Nov 12, 2018

alimanfoo commented Nov 13, 2018 via email

jkmacc-LANL commented Nov 13, 2018 •

edited

Loading

jkmacc-LANL commented Nov 14, 2018

jakirkham commented Nov 19, 2018

jkmacc-LANL commented Nov 24, 2018

jakirkham commented Nov 28, 2018

New codec contribution guide #125

New codec contribution guide #125

Comments

jkmacc-LANL commented Nov 12, 2018

alimanfoo commented Nov 12, 2018

jakirkham commented Nov 12, 2018

jkmacc-LANL commented Nov 12, 2018

alimanfoo commented Nov 12, 2018 via email

jakirkham commented Nov 12, 2018

jkmacc-LANL commented Nov 12, 2018

alimanfoo commented Nov 13, 2018 via email

jkmacc-LANL commented Nov 13, 2018 • edited Loading

jkmacc-LANL commented Nov 14, 2018

jakirkham commented Nov 19, 2018

jkmacc-LANL commented Nov 24, 2018

jakirkham commented Nov 28, 2018

jkmacc-LANL commented Nov 13, 2018 •

edited

Loading