Modular encoding #175

akleeman · 2014-06-25T10:37:41Z

Restructured Backends to make CF conventions handling consistent.

Among other things this includes:

EncodedDataStores which can wrap other stores and allow
for modular encoding/decoding.
Trivial indices ds['x'] = ('x', np.arange(10)) are no longer
stored on disk and are only created when accessed.
AbstractDataStore API change. Shouldn't effect external users.
missing_value attributes now function like _FillValue

All current tests are passing (though it could use more new ones).

consistent. Amongst other things this includes: - EncodedDataStores which can wrap other stores and allow for modular encoding/decoding. - Trivial indices ds['x'] = ('x', np.arange(10)) are no longer stored on disk and are only created when accessed. - AbstractDataStore API change. Shouldn't effect external users. - missing_value attributes now function like _FillValue All current tests are passing (though it could use more new ones).

shoyer · 2014-06-26T01:10:50Z

xray/conventions.py

+            fill_value = missing_value
+        # if both were given we make sure they are the same.
+        if fill_value is not None and missing_value is not None:
+            assert fill_value == missing_value


This will fail if both are nan.... try utils.equivalent instead of ==?

shoyer · 2014-06-26T02:10:31Z

Generally this looks very nice! I'm still mulling over the API -- it seems close but not quite right yet. In particular, using a decorator + inheritance to implement custom encoding seems like too much.

shoyer · 2014-06-26T06:07:49Z

xray/backends/memory.py

-        self.dimensions = OrderedDict()
-        self.variables = OrderedDict()
-        self.attributes = OrderedDict()
+    def __init__(self, dict_store=None):


How about making the signature closer to a Dataset, like this:

def __init__(self, variables=None, attributes=None): ...

It's pretty straightforward to use dict unpacking if you want to pass in a dict_store object like you're currently doing.

shoyer · 2014-06-26T08:09:06Z

Some additional thoughts on the advanced interface (we should discuss more when we're both awake):

The advanced interface should support specifying decoding or encoding only (e.g., if we only need to read some obscure format, not write it).

Instead of all these DataStore subclasses, what about having more generic Coder (singleton?) objects which have encode and/or decode methods, but that don't store any data? It is somewhat confusing to keep track of state with all these custom datastores.

A neat trick about coders rather than data stores is that they are simple enough that they could be easily composed, similar to a scikit-learn pipeline. For example, the default CFCoder could be written as something like:

CFCoder = Compose(TimeCoder, ScaleCoder, MaskCoder, DTypeCoder)

I suppose the painful part of using coders is the need to close files on disk to cleanup. Still, I would rather have a single EncodedDataStore class which is initialized with coder and underlying_store arguments (either classes or objects) rather than have the primary interface be writing custom AbstractEncodedDataStore subclasses. That feels like unnecessary complexity.

Instead of using the decorator, the interface could look something like:

CFNetCDFStore = EncodedDataStore(CFCoder, NetCDF4DataStore)

or

my_store = EncodedDataStore(CFCoder(**my_options), NetCDF4DataStore('test.nc'))

shoyer · 2014-06-26T08:10:20Z

xray/backends/netCDF4_.py

@@ -88,6 +88,7 @@ def _ensure_fill_value_valid(data, attributes):
        attributes['_FillValue'] = np.string_(attributes['_FillValue'])


+@cf_encoded


Do we really want to require that NetCDFs always be CF encoded? This could make it harder to extend.

shoyer · 2014-06-26T08:14:26Z

Another possibility to think about which I think would be a bit cleaner if we could pull it off -- decode/encode could take and return two arguments (variables, attributes) instead of actual datastores.

shoyer · 2014-06-26T08:16:40Z

xray/backends/memory.py

+    def set_variable(self, k, v):
+        new_var = copy.deepcopy(v)
+        # we copy the variable and stuff all encodings in the
+        # attributes to imitate what happens when writting to disk.


Perhaps we need two different types of stores, one which does this sort of stuff (copying variable data like an on-disk store) and another for using as the return value from encoders/decoders. I don't think we want this copying behavior for the later.

shoyer · 2014-06-26T21:44:14Z

@leon-barrett any thoughts on the design here?

shoyer · 2014-09-23T04:55:08Z

I would like to clean up this PR and get it merged.

One useful functionality that would be good to handle at the same time is ensuring that we have an interface that lets us encode or decode a dataset in memory, not just when loading or saving to disk.

shoyer · 2014-09-26T04:20:07Z

Here are some use case for encoding:

save or load a dataset to a file using conventions
encode or decode a dataset to facilitate loading to or from another library (e.g., Iris or CDAT)
load a dataset that doesn't quite satisfy conventions from disk, fix it up, and then decode it.
directly use the dataset constructor to input encoded data, and then decode it

This patch does 1 pretty well, but not the others. I think the cleanest way to handle everything would be to separate Conventions from DataStores. That way we could also let you write something like ds.decode(conventions=CFConventions) (or even just ds.decode('CF') or ds.decode() for short) to decode a dataset into another dataset.

So the user would only need to write something like that looks like this, instead of a subclass of AbstractEncodedDataStore:

class Conventions(object):
    def encode(self, arrays, attrs):
        return array, attrs
    def decode(self, arrays, attrs):
        return array, attrs

The bonus here is that Conventions doesn't need to relate to data stores at all, and there's no danger of undesireable coupling. We could even have xray.create_conventions(encoder, decoder) as a shortcut to writing the class.

shoyer · 2014-09-30T20:07:36Z

OK, I made a brief attempt at trying to get my proposal to work, but I don't think I'm smart enough to figure out a completely general approach.

@akleeman what would be the minimal new API that would suffice for your needs? I would rather not commit ourselves to supporting a fully extensible approach to encoding/decoding in the public API at this point (mostly because I suspect we won't get it right).

shoyer · 2014-10-05T04:24:08Z

@akleeman I just read over my rebased versions of this patch again, and unfortunately, although there are some useful features here (missing value support and not writing trivial indexes), overall I don't think this is the right approach.

The idea of decoding data stores into "CF decoded" data stores is clever, but (1) it adds a large amount of indirection/complexity and (2) it's not even flexible enough (e.g., it won't suffice to decode coordinates, since those only exist on datasets). Functions for CF decoding/encoding that transform a datastore to a dataset directly (and vice versa), more similar to the existing design, seem like a better option overall.

As we discussed, adding an argument like an array_hook to open_dataset and to_netcdf (patterned off of object_hook from the json module) should suffice for at least our immediate custom encoding/decoding needs.

To make things more extensible, let's add a handful of utility functions/classes to the public API (e.g., NDArrayMixin), and break down existing functions like encode_cf_variable/decode_cf_variable into more modular/extensible components.

shoyer · 2014-10-08T20:44:15Z

continued in #245.

* tests * drop_nodes implementation * whatsnew * added drop_nodes to API docs page

Alex Kleeman added 3 commits June 25, 2014 02:37

Cleanup

5dad5fb

Python 3 compat issues

134d5ba

shoyer reviewed Jun 26, 2014
View reviewed changes

shoyer mentioned this pull request Jul 15, 2014

Checklist for v0.2 release #183

Closed

16 tasks

shoyer added internals labels Aug 2, 2014

shoyer mentioned this pull request Oct 2, 2014

Modular encodings (rebased) #245

Merged

shoyer closed this Oct 8, 2014

keewis pushed a commit to keewis/xarray that referenced this pull request Jan 17, 2024

Added drop_nodes method (pydata#175)

ba180d5

* tests * drop_nodes implementation * whatsnew * added drop_nodes to API docs page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modular encoding #175

Modular encoding #175

akleeman commented Jun 25, 2014

shoyer Jun 26, 2014

shoyer commented Jun 26, 2014

shoyer Jun 26, 2014

shoyer commented Jun 26, 2014

shoyer Jun 26, 2014

shoyer commented Jun 26, 2014

shoyer Jun 26, 2014

shoyer commented Jun 26, 2014

shoyer commented Sep 23, 2014

shoyer commented Sep 26, 2014

shoyer commented Sep 30, 2014

shoyer commented Oct 5, 2014

shoyer commented Oct 8, 2014

		@@ -88,6 +88,7 @@ def _ensure_fill_value_valid(data, attributes):
		attributes['_FillValue'] = np.string_(attributes['_FillValue'])


		@cf_encoded

Modular encoding #175

Modular encoding #175

Conversation

akleeman commented Jun 25, 2014

shoyer Jun 26, 2014

Choose a reason for hiding this comment

shoyer commented Jun 26, 2014

shoyer Jun 26, 2014

Choose a reason for hiding this comment

shoyer commented Jun 26, 2014

shoyer Jun 26, 2014

Choose a reason for hiding this comment

shoyer commented Jun 26, 2014

shoyer Jun 26, 2014

Choose a reason for hiding this comment

shoyer commented Jun 26, 2014

shoyer commented Sep 23, 2014

shoyer commented Sep 26, 2014

shoyer commented Sep 30, 2014

shoyer commented Oct 5, 2014

shoyer commented Oct 8, 2014