Skip to content

Modular encoding #175

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Modular encoding #175

wants to merge 3 commits into from

Conversation

akleeman
Copy link
Contributor

Restructured Backends to make CF conventions handling consistent.

Among other things this includes:

  • EncodedDataStores which can wrap other stores and allow
    for modular encoding/decoding.
  • Trivial indices ds['x'] = ('x', np.arange(10)) are no longer
    stored on disk and are only created when accessed.
  • AbstractDataStore API change. Shouldn't effect external users.
  • missing_value attributes now function like _FillValue

All current tests are passing (though it could use more new ones).

Alex Kleeman added 3 commits June 25, 2014 02:37
consistent.

Amongst other things this includes:
- EncodedDataStores which can wrap other stores and allow
  for modular encoding/decoding.
- Trivial indices ds['x'] = ('x', np.arange(10)) are no longer
  stored on disk and are only created when accessed.
- AbstractDataStore API change.  Shouldn't effect external users.
- missing_value attributes now function like _FillValue

All current tests are passing (though it could use more new ones).
fill_value = missing_value
# if both were given we make sure they are the same.
if fill_value is not None and missing_value is not None:
assert fill_value == missing_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if both are nan.... try utils.equivalent instead of ==?

@shoyer
Copy link
Member

shoyer commented Jun 26, 2014

Generally this looks very nice! I'm still mulling over the API -- it seems close but not quite right yet. In particular, using a decorator + inheritance to implement custom encoding seems like too much.

self.dimensions = OrderedDict()
self.variables = OrderedDict()
self.attributes = OrderedDict()
def __init__(self, dict_store=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making the signature closer to a Dataset, like this:

def __init__(self, variables=None, attributes=None):
    ...

It's pretty straightforward to use dict unpacking if you want to pass in a dict_store object like you're currently doing.

@shoyer
Copy link
Member

shoyer commented Jun 26, 2014

Some additional thoughts on the advanced interface (we should discuss more when we're both awake):

The advanced interface should support specifying decoding or encoding only (e.g., if we only need to read some obscure format, not write it).

Instead of all these DataStore subclasses, what about having more generic Coder (singleton?) objects which have encode and/or decode methods, but that don't store any data? It is somewhat confusing to keep track of state with all these custom datastores.

A neat trick about coders rather than data stores is that they are simple enough that they could be easily composed, similar to a scikit-learn pipeline. For example, the default CFCoder could be written as something like:

CFCoder = Compose(TimeCoder, ScaleCoder, MaskCoder, DTypeCoder)

I suppose the painful part of using coders is the need to close files on disk to cleanup. Still, I would rather have a single EncodedDataStore class which is initialized with coder and underlying_store arguments (either classes or objects) rather than have the primary interface be writing custom AbstractEncodedDataStore subclasses. That feels like unnecessary complexity.

Instead of using the decorator, the interface could look something like:

CFNetCDFStore = EncodedDataStore(CFCoder, NetCDF4DataStore)

or

my_store = EncodedDataStore(CFCoder(**my_options), NetCDF4DataStore('test.nc'))

@@ -88,6 +88,7 @@ def _ensure_fill_value_valid(data, attributes):
attributes['_FillValue'] = np.string_(attributes['_FillValue'])


@cf_encoded
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to require that NetCDFs always be CF encoded? This could make it harder to extend.

@shoyer
Copy link
Member

shoyer commented Jun 26, 2014

Another possibility to think about which I think would be a bit cleaner if we could pull it off -- decode/encode could take and return two arguments (variables, attributes) instead of actual datastores.

def set_variable(self, k, v):
new_var = copy.deepcopy(v)
# we copy the variable and stuff all encodings in the
# attributes to imitate what happens when writting to disk.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we need two different types of stores, one which does this sort of stuff (copying variable data like an on-disk store) and another for using as the return value from encoders/decoders. I don't think we want this copying behavior for the later.

@shoyer
Copy link
Member

shoyer commented Jun 26, 2014

@leon-barrett any thoughts on the design here?

@shoyer shoyer mentioned this pull request Jul 15, 2014
16 tasks
@shoyer
Copy link
Member

shoyer commented Sep 23, 2014

I would like to clean up this PR and get it merged.

One useful functionality that would be good to handle at the same time is ensuring that we have an interface that lets us encode or decode a dataset in memory, not just when loading or saving to disk.

@shoyer
Copy link
Member

shoyer commented Sep 26, 2014

Here are some use case for encoding:

  1. save or load a dataset to a file using conventions
  2. encode or decode a dataset to facilitate loading to or from another library (e.g., Iris or CDAT)
  3. load a dataset that doesn't quite satisfy conventions from disk, fix it up, and then decode it.
  4. directly use the dataset constructor to input encoded data, and then decode it

This patch does 1 pretty well, but not the others. I think the cleanest way to handle everything would be to separate Conventions from DataStores. That way we could also let you write something like ds.decode(conventions=CFConventions) (or even just ds.decode('CF') or ds.decode() for short) to decode a dataset into another dataset.

So the user would only need to write something like that looks like this, instead of a subclass of AbstractEncodedDataStore:

class Conventions(object):
    def encode(self, arrays, attrs):
        return array, attrs
    def decode(self, arrays, attrs):
        return array, attrs

The bonus here is that Conventions doesn't need to relate to data stores at all, and there's no danger of undesireable coupling. We could even have xray.create_conventions(encoder, decoder) as a shortcut to writing the class.

@shoyer
Copy link
Member

shoyer commented Sep 30, 2014

OK, I made a brief attempt at trying to get my proposal to work, but I don't think I'm smart enough to figure out a completely general approach.

@akleeman what would be the minimal new API that would suffice for your needs? I would rather not commit ourselves to supporting a fully extensible approach to encoding/decoding in the public API at this point (mostly because I suspect we won't get it right).

@shoyer
Copy link
Member

shoyer commented Oct 5, 2014

@akleeman I just read over my rebased versions of this patch again, and unfortunately, although there are some useful features here (missing value support and not writing trivial indexes), overall I don't think this is the right approach.

The idea of decoding data stores into "CF decoded" data stores is clever, but (1) it adds a large amount of indirection/complexity and (2) it's not even flexible enough (e.g., it won't suffice to decode coordinates, since those only exist on datasets). Functions for CF decoding/encoding that transform a datastore to a dataset directly (and vice versa), more similar to the existing design, seem like a better option overall.

As we discussed, adding an argument like an array_hook to open_dataset and to_netcdf (patterned off of object_hook from the json module) should suffice for at least our immediate custom encoding/decoding needs.

To make things more extensible, let's add a handful of utility functions/classes to the public API (e.g., NDArrayMixin), and break down existing functions like encode_cf_variable/decode_cf_variable into more modular/extensible components.

@shoyer
Copy link
Member

shoyer commented Oct 8, 2014

continued in #245.

@shoyer shoyer closed this Oct 8, 2014
keewis pushed a commit to keewis/xarray that referenced this pull request Jan 17, 2024
* tests

* drop_nodes implementation

* whatsnew

* added drop_nodes to API docs page
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants