-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Modular encoding #175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modular encoding #175
Conversation
consistent. Amongst other things this includes: - EncodedDataStores which can wrap other stores and allow for modular encoding/decoding. - Trivial indices ds['x'] = ('x', np.arange(10)) are no longer stored on disk and are only created when accessed. - AbstractDataStore API change. Shouldn't effect external users. - missing_value attributes now function like _FillValue All current tests are passing (though it could use more new ones).
fill_value = missing_value | ||
# if both were given we make sure they are the same. | ||
if fill_value is not None and missing_value is not None: | ||
assert fill_value == missing_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail if both are nan
.... try utils.equivalent
instead of ==
?
Generally this looks very nice! I'm still mulling over the API -- it seems close but not quite right yet. In particular, using a decorator + inheritance to implement custom encoding seems like too much. |
self.dimensions = OrderedDict() | ||
self.variables = OrderedDict() | ||
self.attributes = OrderedDict() | ||
def __init__(self, dict_store=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making the signature closer to a Dataset, like this:
def __init__(self, variables=None, attributes=None):
...
It's pretty straightforward to use dict unpacking if you want to pass in a dict_store object like you're currently doing.
Some additional thoughts on the advanced interface (we should discuss more when we're both awake): The advanced interface should support specifying decoding or encoding only (e.g., if we only need to read some obscure format, not write it). Instead of all these DataStore subclasses, what about having more generic A neat trick about coders rather than data stores is that they are simple enough that they could be easily composed, similar to a scikit-learn pipeline. For example, the default CFCoder could be written as something like:
I suppose the painful part of using coders is the need to close files on disk to cleanup. Still, I would rather have a single Instead of using the decorator, the interface could look something like:
or
|
@@ -88,6 +88,7 @@ def _ensure_fill_value_valid(data, attributes): | |||
attributes['_FillValue'] = np.string_(attributes['_FillValue']) | |||
|
|||
|
|||
@cf_encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to require that NetCDFs always be CF encoded? This could make it harder to extend.
Another possibility to think about which I think would be a bit cleaner if we could pull it off -- decode/encode could take and return two arguments |
def set_variable(self, k, v): | ||
new_var = copy.deepcopy(v) | ||
# we copy the variable and stuff all encodings in the | ||
# attributes to imitate what happens when writting to disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we need two different types of stores, one which does this sort of stuff (copying variable data like an on-disk store) and another for using as the return value from encoders/decoders. I don't think we want this copying behavior for the later.
@leon-barrett any thoughts on the design here? |
I would like to clean up this PR and get it merged. One useful functionality that would be good to handle at the same time is ensuring that we have an interface that lets us encode or decode a dataset in memory, not just when loading or saving to disk. |
Here are some use case for encoding:
This patch does 1 pretty well, but not the others. I think the cleanest way to handle everything would be to separate Conventions from DataStores. That way we could also let you write something like So the user would only need to write something like that looks like this, instead of a subclass of
The bonus here is that |
OK, I made a brief attempt at trying to get my proposal to work, but I don't think I'm smart enough to figure out a completely general approach. @akleeman what would be the minimal new API that would suffice for your needs? I would rather not commit ourselves to supporting a fully extensible approach to encoding/decoding in the public API at this point (mostly because I suspect we won't get it right). |
@akleeman I just read over my rebased versions of this patch again, and unfortunately, although there are some useful features here (missing value support and not writing trivial indexes), overall I don't think this is the right approach. The idea of decoding data stores into "CF decoded" data stores is clever, but (1) it adds a large amount of indirection/complexity and (2) it's not even flexible enough (e.g., it won't suffice to decode coordinates, since those only exist on datasets). Functions for CF decoding/encoding that transform a datastore to a dataset directly (and vice versa), more similar to the existing design, seem like a better option overall. As we discussed, adding an argument like an To make things more extensible, let's add a handful of utility functions/classes to the public API (e.g., |
continued in #245. |
* tests * drop_nodes implementation * whatsnew * added drop_nodes to API docs page
Restructured Backends to make CF conventions handling consistent.
Among other things this includes:
for modular encoding/decoding.
stored on disk and are only created when accessed.
All current tests are passing (though it could use more new ones).