Skip to content

xr.open_dataset(f1).to_netcdf(file2) is not idempotent #2871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yt87 opened this issue Apr 5, 2019 · 5 comments · Fixed by #2973
Closed

xr.open_dataset(f1).to_netcdf(file2) is not idempotent #2871

yt87 opened this issue Apr 5, 2019 · 5 comments · Fixed by #2973

Comments

@yt87
Copy link

yt87 commented Apr 5, 2019

Here is the original (much truncated) file.

> ncdump ak.nc
netcdf ak {
dimensions:
        npts = UNLIMITED ; // (2 currently)
        ntimes = 4 ;
variables:
        short tmpk(npts, ntimes) ;
                tmpk:description = "2 m Temperature - closest to top of hour" ;
                tmpk:units = "K" ;
                tmpk:level = "2 m" ;
                tmpk:period_variable = "ntimes1" ;
                tmpk:missing_value = -9999s ;
                tmpk:scale_factor = 0.01 ;

// global attributes:
                :source = "ak-obs" ;
data:

 tmpk =
  26915, 27755, -9999, 27705,
  25595, -9999, 28315, -9999 ;
}

Python code:

ds = xr.open_dataset('ak.nc')
ds.to_netcdf('akbad.nc')
ds
ds['tmpk']

<xarray.Dataset>
Dimensions:  (npts: 2, ntimes: 4)
Dimensions without coordinates: npts, ntimes
Data variables:
    tmpk     (npts, ntimes) float32 ...
Attributes:
    source:   ak-obs

<xarray.DataArray 'tmpk' (npts: 2, ntimes: 4)>
array([[269.15, 277.55,    nan, 277.05],
       [255.95,    nan, 283.15,    nan]], dtype=float32)
Dimensions without coordinates: npts, ntimes
Attributes:
    description:      2 m Temperature - closest to top of hour
    units:            K
    level:            2 m
    period_variable:  ntimes1

File written to disk:

> ncdump akbad.nc
netcdf akbad {
dimensions:
        npts = UNLIMITED ; // (2 currently)
        ntimes = 4 ;
variables:
        short tmpk(npts, ntimes) ;
                tmpk:description = "2 m Temperature - closest to top of hour" ;
                tmpk:units = "K" ;
                tmpk:level = "2 m" ;
                tmpk:period_variable = "ntimes1" ;
                tmpk:scale_factor = 0.01 ;

// global attributes:
                :source = "ak-obs" ;
data:

 tmpk =
  26915, 27755, 0, 27705,
  25595, 0, 28315, 0 ;
}

To confuse matter more, I am getting a warning:

SerializationWarning: saving variable tmpk with floating point data as an integer dtype without any _FillValue to use for NaNs

This might make sense, since tmpk after decoding has datatype float32, however somehow the original variable type and scale_factor are preserved, but missing_value attribute disappears and value written back is wrong: 0.
I want to write back tmpk as float number and let zlib worry about disk space.

xr.show_versions()
INSTALLED VERSIONS

commit: None
python: 3.7.1 | packaged by conda-forge | (default, Feb 18 2019, 01:42:00)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 3.10.0-957.5.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.4
libnetcdf: 4.6.2

xarray: 0.12.0
pandas: 0.24.2
numpy: 1.15.4
scipy: 1.2.1
netCDF4: 1.5.0
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.0.3.4
nc_time_axis: None
PseudonetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 1.1.4
distributed: 1.26.0
matplotlib: 3.0.3
cartopy: 0.17.0
seaborn: 0.9.0
setuptools: 40.8.0
pip: 19.0.3
conda: 4.6.10
pytest: 4.3.1
IPython: 7.4.0
sphinx: 1.8.5

@shoyer
Copy link
Member

shoyer commented Apr 6, 2019

I think this could be fixed by making CFMaskCoder handle missing_value and _FillValue in a symmetric way in encoding/decoding:

class CFMaskCoder(VariableCoder):

Right now, it only uses missing_value in the decode() method.

@yt87
Copy link
Author

yt87 commented Apr 6, 2019

Indeed it works. Thanks. My quick fix:

$ diff variables.py variables.py.orig
152,155d151
<         elif encoding.get('missing_value') is not None:
<             fill_value = pop_to(encoding, attrs, 'missing_value', name=name)
<             if not pd.isnull(fill_value):
<                 data = duck_array_ops.fillna(data, fill_value)

I also figured out how to write back floating point values: encoding=None means use existing values,
so specifying encoding={'tmpk': {}} in to_netcdf() did the trick. Should there be an option for this? What you see on the screen is not what you get in the file.

@shoyer
Copy link
Member

shoyer commented Apr 7, 2019

Indeed it works. Thanks. My quick fix:

$ diff variables.py variables.py.orig
152,155d151
<         elif encoding.get('missing_value') is not None:
<             fill_value = pop_to(encoding, attrs, 'missing_value', name=name)
<             if not pd.isnull(fill_value):
<                 data = duck_array_ops.fillna(data, fill_value)

I also figured out how to write back floating point values: encoding=None means use existing values,
so specifying encoding={'tmpk': {}} in to_netcdf() did the trick. Should there be an option for this? What you see on the screen is not what you get in the file.

Yes, this is a challenging design problem and the current state of affairs is definitely not ideal. If you have ideas about how to reveal the presence of an encoding dict associated with a variable without overwhelming the user with information those would be very welcome. The alternative would be to not keep around encoding at all, but I think users do appreciate that reading/writing a netCDF file generally returns the exact same thing.

@yt87
Copy link
Author

yt87 commented Apr 8, 2019

After rethinking the issue, I would drop it: one can simply pass ds.fromkeys(ds.data_vars.keys(), {}) as the encoding attribute.
Going back to the original problem. The fix above is not enough, the SerializationWarning is still present. An alternative, provided that missing_value attribute is still considered deprecated: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.1/build/cf-conventions.html#missing-data, would be to replace it by _FillValue on decoding:

$ diff variables.py variables.py.orig 
179,180d178
<             if 'FillValue' not in encoding:
<                 encoding['_FillValue'] = encoding.pop('missing_value')``

dcherian added a commit to dcherian/xarray that referenced this issue May 19, 2019
dcherian added a commit to dcherian/xarray that referenced this issue May 19, 2019
@dopplershift
Copy link
Contributor

Just to correct something here, missing_value is no longer considered deprecated in the current version of the standard: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html

Near the bottom is a specific note about removing the deprecation.

dcherian added a commit that referenced this issue Jun 12, 2019
* More support for missing_value.

Fixes #2871

* lint fixes.

* Use not equivalent instead of not equals check.

* lint fix.

* if → elif so we don't call fillna twice

* Better fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants