xr.open_dataset(f1).to_netcdf(file2) is not idempotent #2871

yt87 · 2019-04-05T20:06:35Z

Here is the original (much truncated) file.

> ncdump ak.nc
netcdf ak {
dimensions:
        npts = UNLIMITED ; // (2 currently)
        ntimes = 4 ;
variables:
        short tmpk(npts, ntimes) ;
                tmpk:description = "2 m Temperature - closest to top of hour" ;
                tmpk:units = "K" ;
                tmpk:level = "2 m" ;
                tmpk:period_variable = "ntimes1" ;
                tmpk:missing_value = -9999s ;
                tmpk:scale_factor = 0.01 ;

// global attributes:
                :source = "ak-obs" ;
data:

 tmpk =
  26915, 27755, -9999, 27705,
  25595, -9999, 28315, -9999 ;
}

Python code:

ds = xr.open_dataset('ak.nc')
ds.to_netcdf('akbad.nc')
ds
ds['tmpk']

<xarray.Dataset>
Dimensions:  (npts: 2, ntimes: 4)
Dimensions without coordinates: npts, ntimes
Data variables:
    tmpk     (npts, ntimes) float32 ...
Attributes:
    source:   ak-obs

<xarray.DataArray 'tmpk' (npts: 2, ntimes: 4)>
array([[269.15, 277.55,    nan, 277.05],
       [255.95,    nan, 283.15,    nan]], dtype=float32)
Dimensions without coordinates: npts, ntimes
Attributes:
    description:      2 m Temperature - closest to top of hour
    units:            K
    level:            2 m
    period_variable:  ntimes1

File written to disk:

> ncdump akbad.nc
netcdf akbad {
dimensions:
        npts = UNLIMITED ; // (2 currently)
        ntimes = 4 ;
variables:
        short tmpk(npts, ntimes) ;
                tmpk:description = "2 m Temperature - closest to top of hour" ;
                tmpk:units = "K" ;
                tmpk:level = "2 m" ;
                tmpk:period_variable = "ntimes1" ;
                tmpk:scale_factor = 0.01 ;

// global attributes:
                :source = "ak-obs" ;
data:

 tmpk =
  26915, 27755, 0, 27705,
  25595, 0, 28315, 0 ;
}

To confuse matter more, I am getting a warning:

SerializationWarning: saving variable tmpk with floating point data as an integer dtype without any _FillValue to use for NaNs

This might make sense, since tmpk after decoding has datatype float32, however somehow the original variable type and scale_factor are preserved, but missing_value attribute disappears and value written back is wrong: 0.
I want to write back tmpk as float number and let zlib worry about disk space.

xr.show_versions()
INSTALLED VERSIONS

commit: None
python: 3.7.1 | packaged by conda-forge | (default, Feb 18 2019, 01:42:00)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.5.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.12.0
pandas: 0.24.2
numpy: 1.15.4
scipy: 1.2.1
netCDF4: 1.5.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 1.1.4
distributed: 1.26.0
matplotlib: 3.0.3
cartopy: 0.17.0
seaborn: 0.9.0
setuptools: 40.8.0
pip: 19.0.3
conda: 4.6.10
pytest: 4.3.1
IPython: 7.4.0
sphinx: 1.8.5

The text was updated successfully, but these errors were encountered:

shoyer · 2019-04-06T00:49:39Z

I think this could be fixed by making CFMaskCoder handle missing_value and _FillValue in a symmetric way in encoding/decoding:

xarray/xarray/coding/variables.py

Line 142 in 23d54a8

class CFMaskCoder(VariableCoder):

Right now, it only uses missing_value in the decode() method.

yt87 · 2019-04-06T05:24:52Z

Indeed it works. Thanks. My quick fix:

$ diff variables.py variables.py.orig
152,155d151
<         elif encoding.get('missing_value') is not None:
<             fill_value = pop_to(encoding, attrs, 'missing_value', name=name)
<             if not pd.isnull(fill_value):
<                 data = duck_array_ops.fillna(data, fill_value)

I also figured out how to write back floating point values: encoding=None means use existing values,
so specifying encoding={'tmpk': {}} in to_netcdf() did the trick. Should there be an option for this? What you see on the screen is not what you get in the file.

shoyer · 2019-04-07T20:11:05Z

Indeed it works. Thanks. My quick fix:
$ diff variables.py variables.py.orig
152,155d151
<         elif encoding.get('missing_value') is not None:
<             fill_value = pop_to(encoding, attrs, 'missing_value', name=name)
<             if not pd.isnull(fill_value):
<                 data = duck_array_ops.fillna(data, fill_value)
I also figured out how to write back floating point values: encoding=None means use existing values,
so specifying encoding={'tmpk': {}} in to_netcdf() did the trick. Should there be an option for this? What you see on the screen is not what you get in the file.

Yes, this is a challenging design problem and the current state of affairs is definitely not ideal. If you have ideas about how to reveal the presence of an encoding dict associated with a variable without overwhelming the user with information those would be very welcome. The alternative would be to not keep around encoding at all, but I think users do appreciate that reading/writing a netCDF file generally returns the exact same thing.

yt87 · 2019-04-08T22:35:10Z

After rethinking the issue, I would drop it: one can simply pass ds.fromkeys(ds.data_vars.keys(), {}) as the encoding attribute.
Going back to the original problem. The fix above is not enough, the SerializationWarning is still present. An alternative, provided that missing_value attribute is still considered deprecated: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.1/build/cf-conventions.html#missing-data, would be to replace it by _FillValue on decoding:

$ diff variables.py variables.py.orig 
179,180d178
<             if 'FillValue' not in encoding:
<                 encoding['_FillValue'] = encoding.pop('missing_value')``

Fixes pydata#2871

dopplershift · 2019-06-07T15:06:29Z

Just to correct something here, missing_value is no longer considered deprecated in the current version of the standard: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html

Near the bottom is a specific note about removing the deprecation.

* More support for missing_value. Fixes #2871 * lint fixes. * Use not equivalent instead of not equals check. * lint fix. * if → elif so we don't call fillna twice * Better fix.

shoyer added topic-CF conventions enhancement contrib-good-first-issue labels Apr 7, 2019

dcherian added a commit to dcherian/xarray that referenced this issue May 19, 2019

More support for missing_value.

a936205

Fixes pydata#2871

dcherian added a commit to dcherian/xarray that referenced this issue May 19, 2019

More support for missing_value.

c96578f

Fixes pydata#2871

dcherian mentioned this issue May 19, 2019

More support for missing_value. #2973

Merged

3 tasks

dcherian closed this as completed in #2973 Jun 12, 2019

dcherian added a commit that referenced this issue Jun 12, 2019

More support for missing_value. (#2973)

2357851

* More support for missing_value. Fixes #2871 * lint fixes. * Use not equivalent instead of not equals check. * lint fix. * if → elif so we don't call fillna twice * Better fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xr.open_dataset(f1).to_netcdf(file2) is not idempotent #2871

xr.open_dataset(f1).to_netcdf(file2) is not idempotent #2871

yt87 commented Apr 5, 2019

shoyer commented Apr 6, 2019

yt87 commented Apr 6, 2019

shoyer commented Apr 7, 2019

yt87 commented Apr 8, 2019

dopplershift commented Jun 7, 2019

xr.open_dataset(f1).to_netcdf(file2) is not idempotent #2871

xr.open_dataset(f1).to_netcdf(file2) is not idempotent #2871

Comments

yt87 commented Apr 5, 2019

xr.show_versions() INSTALLED VERSIONS

shoyer commented Apr 6, 2019

yt87 commented Apr 6, 2019

shoyer commented Apr 7, 2019

yt87 commented Apr 8, 2019

dopplershift commented Jun 7, 2019

xr.show_versions()
INSTALLED VERSIONS