Skip to content

per-variable fill values #4237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Aug 24, 2020
Merged

per-variable fill values #4237

merged 21 commits into from
Aug 24, 2020

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Jul 17, 2020

This allows specifying different fill values per variable, defaulting to dtypes.NA. There's no documentation updates, yet: I'll work on that once I'm sure I found every function for which this change makes sense.

Here's a demo on how this works:
In [3]: ds = xr.Dataset( 
   ...:     {"a": ("x", [2, 3]), "b": ("x", [-9, 4])}, 
   ...:     coords={"x": [0, 1], "u": ("x", ["a", "b"])}, 
   ...: ) 
   ...: ds
Out[3]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 0 1
    u        (x) <U1 'a' 'b'
Data variables:
    a        (x) int64 2 3
    b        (x) int64 -9 4

In [4]: ds.reindex(x=[-1, 0])
Out[4]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 -1 0
    u        (x) object nan 'a'
Data variables:
    a        (x) float64 nan 2.0
    b        (x) float64 nan -9.0

In [5]: ds.reindex(x=[-1, 0], fill_value={"u": "z", "a": 10})
Out[5]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 -1 0
    u        (x) <U1 'z' 'a'
Data variables:
    a        (x) int64 10 2
    b        (x) float64 nan -9.0

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice to me.

Should we add more tests for higher level functions: for e.g. I think we may need to modify unstack also (I found this by searching for fill_value here: https://xarray.pydata.org/en/stable/api.html)

@keewis
Copy link
Collaborator Author

keewis commented Jul 17, 2020

Should we add more tests for higher level functions

I have been focusing on functions that use reindex, but I agree that others like fillna / bfill / ffill and unstack should allow mappings, too.

@keewis
Copy link
Collaborator Author

keewis commented Jul 23, 2020

Here's the list of functions I'm considering to extend:

  • concat / merge / align / combine_* (but aligning DataArray objects and combine_by_coords have no tests for different fill values)
  • apply_ufunc (?)
  • full_like (for Dataset args)
  • Dataset.reindex / Dataset.reindex_like
  • DataArray.reindex / DataArray.reindex_like (for non-dimension coordinates)
  • Dataset.unstack
  • DataArray.unstack (works since it delegates to Dataset.unstack, but doesn't have tests)
  • Dataset.merge
  • Dataset.shift
  • DataArray.shift (probably doesn't make sense, though)
  • Dataset.pad (already works)
  • DataArray.pad (already works)
  • Dataset.idxmin / Dataset.idxmax: uses Dataset.map and is quite difficult to extend
  • DataArray.idxmin / DataArray.idxmax: doesn't make much sense, I think
  • Dataset.fillna (already works for data variables, but difficult to extend to coordinates)
  • DataArray.fillna
  • Dataset.rolling.construct (not sure if that makes sense)
  • Dataset.interp / DataArray.interp (for extrapolation, using the kwargs parameter): we probably shouldn't support this, but instead recommend ds.interp(...).fillna({"a": ..., "b": ...}).

@dcherian
Copy link
Contributor

  1. I am unsure about apply_ufunc
  2. I think we can safely avoid interp and rolling.construct . It is easy to add a .fillna call
  3. Does shift not shift non-dimensional coordinates? If not, then I agree it doesn't make sense.

@keewis
Copy link
Collaborator Author

keewis commented Jul 23, 2020

  1. me neither, that's why I added the question mark (I'll investigate a bit more)
  2. agreed
  3. yep, the docs say:

Only data variables are moved; coordinates stay in place. This is consistent with the behavior of shift in pandas.

and the code doesn't touch the coordinates (not even non-dimension coordinates).

@keewis
Copy link
Collaborator Author

keewis commented Jul 25, 2020

apply_ufunc turns out to be way too complicated for me to change it without understanding most of the code, so I guess we should postpone that until someone needs multiple fill values on dataset objects?

I think fillna should allow filling coordinates (both dimension and non-dimension coordinates), but the current implementation simply uses apply_ufunc with duck_array_ops.fillna, so I'm not sure how to support that.

Same for idxmin and idxmax: they use Dataset.map or call the computation function directly, so we'd need to rewrite those functions to get fill values.

@keewis keewis force-pushed the fill-value-mapping branch from d420eb7 to e887777 Compare August 5, 2020 09:03
Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @keewis

This is looking great. I think it's OK to skip apply_ufunc, idxmin, idxmax etc.

However, it would be nice to include where since this is a common issue: #3390. But this can also go in a future PR. This is a big step forward already.

@keewis
Copy link
Collaborator Author

keewis commented Aug 19, 2020

does anyone have any advice on how to add multiple fill value support to fillna? It is currently using ops.fillna which uses apply_ufunc with duck_array_ops.fillna:

xarray/xarray/core/ops.py

Lines 137 to 171 in d9ebcaf

def fillna(data, other, join="left", dataset_join="left"):
"""Fill missing values in this object with data from the other object.
Follows normal broadcasting and alignment rules.
Parameters
----------
join : {"outer", "inner", "left", "right"}, optional
Method for joining the indexes of the passed objects along each
dimension
- "outer": use the union of object indexes
- "inner": use the intersection of object indexes
- "left": use indexes from the first object with each dimension
- "right": use indexes from the last object with each dimension
- "exact": raise `ValueError` instead of aligning when indexes to be
aligned are not equal
dataset_join : {"outer", "inner", "left", "right"}, optional
Method for joining variables of Dataset objects with mismatched
data variables.
- "outer": take variables from both Dataset objects
- "inner": take only overlapped variables
- "left": take only variables from the first object
- "right": take only variables from the last object
"""
from .computation import apply_ufunc
return apply_ufunc(
duck_array_ops.fillna,
data,
other,
join=join,
dask="allowed",
dataset_join=dataset_join,
dataset_fill_value=np.nan,
keep_attrs=True,
)
but that obviously only works for data variables.

To fix that, I would manually apply duck_array_ops.fillna to data variables and coordinates and then reassemble the dataset (using _to_temp_dataset / _from_temp_dataset for DataArray), but since I don't really understand apply_ufunc I'm not sure what that would break.

@shoyer
Copy link
Member

shoyer commented Aug 19, 2020

To fix that, I would manually apply duck_array_ops.fillna to data variables and coordinates and then reassemble the dataset (using _to_temp_dataset / _from_temp_dataset for DataArray), but since I don't really understand apply_ufunc, I'm not sure what that would break.

I think this would be a fine way to it, though it does feel rather complicated. Per variable fill-values doesn't quite fit the model of apply_ufunc when applied to entire Dataset/DataArray objects.

@dcherian
Copy link
Contributor

Ya I can't think of a better way than looping through the variables.

@dcherian
Copy link
Contributor

shall we merge and leave the rest to future PRs?

@keewis
Copy link
Collaborator Author

keewis commented Aug 24, 2020

👍. I tried modifying fillna, but that turned out to be harder than I expected.

As a summary, per-variable fill values for fillna and where are still left for other PRs.

@dcherian
Copy link
Contributor

Thanks @keewis

@dcherian dcherian merged commit a36d0a1 into pydata:master Aug 24, 2020
@keewis keewis deleted the fill-value-mapping branch August 24, 2020 22:04
@dcherian
Copy link
Contributor

oops this is missing a whats-new entry.

@keewis
Copy link
Collaborator Author

keewis commented Aug 24, 2020

I'll add one in one of the follow-up PRs

@keewis keewis mentioned this pull request Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow specifying a fill value per variable
3 participants