Skip to content

Unexpected type conversion in variables with _FillValue #6055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jp-dark opened this issue Dec 9, 2021 · 4 comments
Closed

Unexpected type conversion in variables with _FillValue #6055

jp-dark opened this issue Dec 9, 2021 · 4 comments

Comments

@jp-dark
Copy link
Contributor

jp-dark commented Dec 9, 2021

What happened:
When opening a dataset with an int16 variable with the _FillValue attribute, the variable is converted from type int16 to float32. This was originally reported to the TileDB-CF-Py Git repo that contains a TileDB backend for xarray. See TileDB-CF-Py issue #117.

What you expected to happen:
I would expect the type to remain the same when applying the _FillValue.

Minimal Complete Verifiable Example:

Original example from TileDB-CF-Py issue #117 using the TileDB backend.

import tiledb
import xarray as xr
import numpy as np

index = tiledb.Dim(name='index', domain=(0, 3))
domain = tiledb.Domain(index)
var = tiledb.Attr(name='var', dtype=np.int16)
schema = tiledb.ArraySchema(domain=domain, attrs=[var], sparse=False)
tiledb.Array.create('dense_array0', schema)

with tiledb.open('dense_array0', 'w') as A:
    A[:] = np.array([5, 6, 7, 8], dtype=np.int16)

ds = xr.open_dataset('dense_array0', engine='tiledb')
ds['var'].dtype

NetCDF example with the same behavior:

import netCDF4
import xarray  as xr
import numpy as np

filename = 'temp_file.nc'
with netCDF4.Dataset(filename, mode="w") as group:
    group.createDimension("index", 4)
    var = group.createVariable("var", np.int16, ("index",), fill_value=-1)
    var[:] = np.array([5, 6, 7, 8], dtype=np.int16)
dataset = xr.open_dataset(filename)
dataset["var"].dtype

Anything else we need to know?:

  • I was able to verify the type conversion from int16 to float32 occurs in the conventions.decode_cf_variables call in the open_dataset method of StoreBackendEntrypoint.
  • I was able to verify the conversion does not happen if mask_and_scale=False.
  • Note that TileDB is automatically setting a fill value for all dense numerical arrays, and so we are always setting the _FillValue attribute for variables from the TileDB backend.

Environment:
I was able to reproduce this with both xarray 0.19.0 and 0.20.1

@dcherian
Copy link
Contributor

dcherian commented Dec 9, 2021

I think this might be unfixable.

The problem is that xarray represents both missing_value and _FillValue by np.nan. So if _FillValue is present, we promote to a floating type and then apply a mask

def _apply_mask(
data: np.ndarray, encoded_fill_values: list, decoded_fill_value: Any, dtype: Any
) -> np.ndarray:
"""Mask all matching values in a NumPy arrays."""
data = np.asarray(data, dtype=dtype)
condition = False
for fv in encoded_fill_values:
condition |= data == fv
return np.where(condition, decoded_fill_value, data)

Setting mask_and_scale=False seems like the best option, though maybe we should split that up since masking and scaling are independent.

@itcarroll
Copy link
Contributor

itcarroll commented Dec 10, 2021

@dcherian Chiming in as the author of TileDB-Inc/TileDB-CF-Py#117. To help ensure the tiledb backend matches the behavior of xr.open_dataset for netCDF files, can you help me understand why the promotion to float does NOT occur in the following case:

import netCDF4
import xarray  as xr
import numpy as np

filename = 'temp_file.nc'
with netCDF4.Dataset(filename, mode="w") as group:
    group.createDimension("index", 4)
    var = group.createVariable("var", np.int16, ("index",))
    var[0:3] = np.array([5, 6, 7], dtype=np.int16)
dataset = xr.open_dataset(filename)
dataset["var"].dtype

Note that netCDF.default_fillvalues['i2'] is the value found at dataset["var"][3], which was never explicitly written. Here XArray seems to ignore the fill value.

Update

Okay, I think I understand why. In createVariable whether I use the default fill_value=None or set fill_value=False, XArray does not include _FillValue in dataset["var"].encoding. That, I think, is a bug. If you print(var) with fill_value=None, you'll see that NetCDF is setting the _FIllValue attribute. I don't think XArray should ignore it.

@itcarroll
Copy link
Contributor

For future searchers: @jp-dark just added a feature to the upcoming release of the tiledb backend introducing an argument encode_fill to set (or not) the _FillValue metadata in XArray. Then the XArray mask_and_scale works as documented.

@kmuehlbauer
Copy link
Contributor

Well aged issue here, adding some more details.

Okay, I think I understand why. In createVariable whether I use the default fill_value=None or set fill_value=False, XArray does not include _FillValue in dataset["var"].encoding. That, I think, is a bug. If you print(var) with fill_value=None, you'll see that NetCDF is setting the _FIllValue attribute. I don't think XArray should ignore it.

  1. createVariable with fill_value=None, will activate default _FillValue for unwritten parts of the data, but no _FillValue attribute is attached
  <class 'netCDF4._netCDF4.Variable'>
  int16 var(index)
  unlimited dimensions: 
  current shape = (4,)
  filling on, default _FillValue of -32767 used
  [5 6 7 --]
  1. createVariable with fill_value=False, will deactive filling, unwritten parts of the data are set to 0, no _FillValue attribute is attached
  <class 'netCDF4._netCDF4.Variable'>
  int16 var(index)
  unlimited dimensions: 
  current shape = (4,)
  filling off
  [5 6 7 0]
  1. createVariable with fill_value=20, will activate filling, unwritten parts of the data are set to 20, _FillValue attribute is attached
  <class 'netCDF4._netCDF4.Variable'>
  int16 var(index)
      _FillValue: 20
  unlimited dimensions: 
  current shape = (4,)
  filling on
  [5 6 7 --]

Only in case 3 xarray applies the CF-Masking as the _FillValue-attribute is present. There have been discussions if default _FillValues should be taken into account.

#2478
#2374
#2742
#5680

Closing this issue, please comment on #2742.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants