per-variable fill values #4237

keewis · 2020-07-17T12:52:17Z

This allows specifying different fill values per variable, defaulting to dtypes.NA. There's no documentation updates, yet: I'll work on that once I'm sure I found every function for which this change makes sense.

Here's a demo on how this works:

In [3]: ds = xr.Dataset( 
   ...:     {"a": ("x", [2, 3]), "b": ("x", [-9, 4])}, 
   ...:     coords={"x": [0, 1], "u": ("x", ["a", "b"])}, 
   ...: ) 
   ...: ds
Out[3]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 0 1
    u        (x) <U1 'a' 'b'
Data variables:
    a        (x) int64 2 3
    b        (x) int64 -9 4

In [4]: ds.reindex(x=[-1, 0])
Out[4]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 -1 0
    u        (x) object nan 'a'
Data variables:
    a        (x) float64 nan 2.0
    b        (x) float64 nan -9.0

In [5]: ds.reindex(x=[-1, 0], fill_value={"u": "z", "a": 10})
Out[5]: 
<xarray.Dataset>
Dimensions:  (x: 2)
Coordinates:
  * x        (x) int64 -1 0
    u        (x) <U1 'z' 'a'
Data variables:
    a        (x) int64 10 2
    b        (x) float64 nan -9.0

Closes allow specifying a fill value per variable #4165
Tests added
Passes isort . && black . && mypy . && flake8
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

dcherian

Looks nice to me.

Should we add more tests for higher level functions: for e.g. I think we may need to modify unstack also (I found this by searching for fill_value here: https://xarray.pydata.org/en/stable/api.html)

xarray/core/alignment.py

xarray/tests/test_dataset.py

xarray/core/dataarray.py

Co-authored-by: Stephan Hoyer <[email protected]>

keewis · 2020-07-17T22:08:31Z

Should we add more tests for higher level functions

I have been focusing on functions that use reindex, but I agree that others like fillna / bfill / ffill and unstack should allow mappings, too.

keewis · 2020-07-23T13:58:43Z

dcherian · 2020-07-23T15:13:11Z

I am unsure about apply_ufunc
I think we can safely avoid interp and rolling.construct . It is easy to add a .fillna call
Does shift not shift non-dimensional coordinates? If not, then I agree it doesn't make sense.

keewis · 2020-07-23T15:24:19Z

me neither, that's why I added the question mark (I'll investigate a bit more)
agreed
yep, the docs say:

Only data variables are moved; coordinates stay in place. This is consistent with the behavior of shift in pandas.

and the code doesn't touch the coordinates (not even non-dimension coordinates).

keewis · 2020-07-25T00:03:11Z

apply_ufunc turns out to be way too complicated for me to change it without understanding most of the code, so I guess we should postpone that until someone needs multiple fill values on dataset objects?

I think fillna should allow filling coordinates (both dimension and non-dimension coordinates), but the current implementation simply uses apply_ufunc with duck_array_ops.fillna, so I'm not sure how to support that.

Same for idxmin and idxmax: they use Dataset.map or call the computation function directly, so we'd need to rewrite those functions to get fill values.

dcherian

Thanks @keewis

This is looking great. I think it's OK to skip apply_ufunc, idxmin, idxmax etc.

However, it would be nice to include where since this is a common issue: #3390. But this can also go in a future PR. This is a big step forward already.

xarray/core/merge.py

keewis · 2020-08-19T17:02:55Z

does anyone have any advice on how to add multiple fill value support to fillna? It is currently using ops.fillna which uses apply_ufunc with duck_array_ops.fillna:

xarray/xarray/core/ops.py

Lines 137 to 171 in d9ebcaf

    
           def fillna(data, other, join="left", dataset_join="left"): 
        
               """Fill missing values in this object with data from the other object. 
        
               Follows normal broadcasting and alignment rules. 
        
               Parameters 
        
               ---------- 
        
               join : {"outer", "inner", "left", "right"}, optional 
        
                   Method for joining the indexes of the passed objects along each 
        
                   dimension 
        
                   - "outer": use the union of object indexes 
        
                   - "inner": use the intersection of object indexes 
        
                   - "left": use indexes from the first object with each dimension 
        
                   - "right": use indexes from the last object with each dimension 
        
                   - "exact": raise `ValueError` instead of aligning when indexes to be 
        
                     aligned are not equal 
        
               dataset_join : {"outer", "inner", "left", "right"}, optional 
        
                   Method for joining variables of Dataset objects with mismatched 
        
                   data variables. 
        
                   - "outer": take variables from both Dataset objects 
        
                   - "inner": take only overlapped variables 
        
                   - "left": take only variables from the first object 
        
                   - "right": take only variables from the last object 
        
               """ 
        
               from .computation import apply_ufunc 
        
               return apply_ufunc( 
        
                   duck_array_ops.fillna, 
        
                   data, 
        
                   other, 
        
                   join=join, 
        
                   dask="allowed", 
        
                   dataset_join=dataset_join, 
        
                   dataset_fill_value=np.nan, 
        
                   keep_attrs=True, 
        
               )

but that obviously only works for data variables.

To fix that, I would manually apply duck_array_ops.fillna to data variables and coordinates and then reassemble the dataset (using _to_temp_dataset / _from_temp_dataset for DataArray), but since I don't really understand apply_ufunc I'm not sure what that would break.

shoyer · 2020-08-19T20:02:23Z

To fix that, I would manually apply duck_array_ops.fillna to data variables and coordinates and then reassemble the dataset (using _to_temp_dataset / _from_temp_dataset for DataArray), but since I don't really understand apply_ufunc, I'm not sure what that would break.

I think this would be a fine way to it, though it does feel rather complicated. Per variable fill-values doesn't quite fit the model of apply_ufunc when applied to entire Dataset/DataArray objects.

dcherian · 2020-08-20T18:40:33Z

Ya I can't think of a better way than looping through the variables.

dcherian · 2020-08-24T15:59:40Z

shall we merge and leave the rest to future PRs?

keewis · 2020-08-24T16:03:38Z

👍. I tried modifying fillna, but that turned out to be harder than I expected.

As a summary, per-variable fill values for fillna and where are still left for other PRs.

dcherian · 2020-08-24T22:03:09Z

Thanks @keewis

dcherian · 2020-08-24T22:10:41Z

oops this is missing a whats-new entry.

keewis · 2020-08-24T22:14:27Z

I'll add one in one of the follow-up PRs

keewis added 2 commits July 17, 2020 14:34

implement the fill_value mapping

c978ecc

get per-variable fill_values to work in DataArray.reindex

c789953

dcherian reviewed Jul 17, 2020

View reviewed changes

xarray/core/alignment.py Show resolved Hide resolved

xarray/tests/test_dataset.py Show resolved Hide resolved

xarray/core/dataarray.py Outdated Show resolved Hide resolved

shoyer reviewed Jul 17, 2020

View reviewed changes

xarray/core/dataarray.py Outdated Show resolved Hide resolved

xarray/core/dataarray.py Show resolved Hide resolved

keewis and others added 2 commits July 17, 2020 23:46

Update xarray/core/dataarray.py

9b13e0f

Co-authored-by: Stephan Hoyer <[email protected]>

check that the default value is used

e27fdf2

keewis added 6 commits July 23, 2020 15:01

check that merge works with multiple fill values

b725a0f

check that concat works with multiple fill values

cac6dde

check that combine_nested works with multiple fill values

95a3824

check that Dataset.reindex and DataArray.reindex work

f58624f

check that aligning Datasets works

5361214

check that Dataset.unstack works

a9ec54c

Merge branch 'master' into fill-value-mapping

58246c6

allow passing multiple fill values to full_like with datasets

ba8f77c

also allow overriding the dtype by variable

e887777

keewis force-pushed the fill-value-mapping branch from d420eb7 to e887777 Compare August 5, 2020 09:03

keewis added 7 commits August 5, 2020 11:14

document the dict fill values in Dataset.reindex

923c417

document the changes to DataArray.reindex

e9b191c

document the changes to unstack

060626e

document the changes to align

47b61d3

document the changes to concat and merge

42d7001

document the changes to Dataset.shift

0c3b17b

document the changes to combine_*

268783a

dcherian approved these changes Aug 14, 2020

View reviewed changes

xarray/core/merge.py Show resolved Hide resolved

Merge branch 'master' into fill-value-mapping

bcaf16e

dcherian merged commit a36d0a1 into pydata:master Aug 24, 2020

keewis deleted the fill-value-mapping branch August 24, 2020 22:04

keewis mentioned this pull request Sep 18, 2020

Release notes for 0.16.1 #4435

Merged

Uh oh!

per-variable fill values #4237

per-variable fill values #4237

Uh oh!

Conversation

keewis commented Jul 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keewis commented Jul 17, 2020

Uh oh!

keewis commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian commented Jul 23, 2020

Uh oh!

keewis commented Jul 23, 2020

Uh oh!

keewis commented Jul 25, 2020

Uh oh!

dcherian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

keewis commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Aug 19, 2020

Uh oh!

dcherian commented Aug 20, 2020

Uh oh!

dcherian commented Aug 24, 2020

Uh oh!

keewis commented Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian commented Aug 24, 2020

Uh oh!

dcherian commented Aug 24, 2020

Uh oh!

keewis commented Aug 24, 2020

Uh oh!

Uh oh!

keewis commented Jul 17, 2020 •

edited

Loading

keewis commented Jul 23, 2020 •

edited

Loading

keewis commented Aug 19, 2020 •

edited

Loading

keewis commented Aug 24, 2020 •

edited

Loading