Unexpected type conversion in variables with _FillValue #6055

jp-dark · 2021-12-09T16:26:54Z

What happened:
When opening a dataset with an int16 variable with the _FillValue attribute, the variable is converted from type int16 to float32. This was originally reported to the TileDB-CF-Py Git repo that contains a TileDB backend for xarray. See TileDB-CF-Py issue #117.

What you expected to happen:
I would expect the type to remain the same when applying the _FillValue.

Minimal Complete Verifiable Example:

Original example from TileDB-CF-Py issue #117 using the TileDB backend.

import tiledb
import xarray as xr
import numpy as np

index = tiledb.Dim(name='index', domain=(0, 3))
domain = tiledb.Domain(index)
var = tiledb.Attr(name='var', dtype=np.int16)
schema = tiledb.ArraySchema(domain=domain, attrs=[var], sparse=False)
tiledb.Array.create('dense_array0', schema)

with tiledb.open('dense_array0', 'w') as A:
    A[:] = np.array([5, 6, 7, 8], dtype=np.int16)

ds = xr.open_dataset('dense_array0', engine='tiledb')
ds['var'].dtype

NetCDF example with the same behavior:

import netCDF4
import xarray  as xr
import numpy as np

filename = 'temp_file.nc'
with netCDF4.Dataset(filename, mode="w") as group:
    group.createDimension("index", 4)
    var = group.createVariable("var", np.int16, ("index",), fill_value=-1)
    var[:] = np.array([5, 6, 7, 8], dtype=np.int16)
dataset = xr.open_dataset(filename)
dataset["var"].dtype

Anything else we need to know?:

I was able to verify the type conversion from int16 to float32 occurs in the conventions.decode_cf_variables call in the open_dataset method of StoreBackendEntrypoint.
I was able to verify the conversion does not happen if mask_and_scale=False.
Note that TileDB is automatically setting a fill value for all dense numerical arrays, and so we are always setting the _FillValue attribute for variables from the TileDB backend.

Environment:
I was able to reproduce this with both xarray 0.19.0 and 0.20.1

The text was updated successfully, but these errors were encountered:

dcherian · 2021-12-09T19:53:12Z

I think this might be unfixable.

The problem is that xarray represents both missing_value and _FillValue by np.nan. So if _FillValue is present, we promote to a floating type and then apply a mask

xarray/xarray/coding/variables.py

Lines 133 to 141 in a923833

    
           def _apply_mask( 
        
               data: np.ndarray, encoded_fill_values: list, decoded_fill_value: Any, dtype: Any 
        
           ) -> np.ndarray: 
        
               """Mask all matching values in a NumPy arrays.""" 
        
               data = np.asarray(data, dtype=dtype) 
        
               condition = False 
        
               for fv in encoded_fill_values: 
        
                   condition |= data == fv 
        
               return np.where(condition, decoded_fill_value, data)

Setting mask_and_scale=False seems like the best option, though maybe we should split that up since masking and scaling are independent.

itcarroll · 2021-12-10T14:59:29Z

@dcherian Chiming in as the author of TileDB-Inc/TileDB-CF-Py#117. To help ensure the tiledb backend matches the behavior of xr.open_dataset for netCDF files, can you help me understand why the promotion to float does NOT occur in the following case:

import netCDF4
import xarray  as xr
import numpy as np

filename = 'temp_file.nc'
with netCDF4.Dataset(filename, mode="w") as group:
    group.createDimension("index", 4)
    var = group.createVariable("var", np.int16, ("index",))
    var[0:3] = np.array([5, 6, 7], dtype=np.int16)
dataset = xr.open_dataset(filename)
dataset["var"].dtype

Note that netCDF.default_fillvalues['i2'] is the value found at dataset["var"][3], which was never explicitly written. Here XArray seems to ignore the fill value.

Update

Okay, I think I understand why. In createVariable whether I use the default fill_value=None or set fill_value=False, XArray does not include _FillValue in dataset["var"].encoding. That, I think, is a bug. If you print(var) with fill_value=None, you'll see that NetCDF is setting the _FIllValue attribute. I don't think XArray should ignore it.

itcarroll · 2022-01-22T16:45:52Z

For future searchers: @jp-dark just added a feature to the upcoming release of the tiledb backend introducing an argument encode_fill to set (or not) the _FillValue metadata in XArray. Then the XArray mask_and_scale works as documented.

kmuehlbauer · 2023-09-13T12:40:13Z

Well aged issue here, adding some more details.

Okay, I think I understand why. In createVariable whether I use the default fill_value=None or set fill_value=False, XArray does not include _FillValue in dataset["var"].encoding. That, I think, is a bug. If you print(var) with fill_value=None, you'll see that NetCDF is setting the _FIllValue attribute. I don't think XArray should ignore it.

createVariable with fill_value=None, will activate default _FillValue for unwritten parts of the data, but no _FillValue attribute is attached

  <class 'netCDF4._netCDF4.Variable'>
  int16 var(index)
  unlimited dimensions: 
  current shape = (4,)
  filling on, default _FillValue of -32767 used
  [5 6 7 --]

createVariable with fill_value=False, will deactive filling, unwritten parts of the data are set to 0, no _FillValue attribute is attached

  <class 'netCDF4._netCDF4.Variable'>
  int16 var(index)
  unlimited dimensions: 
  current shape = (4,)
  filling off
  [5 6 7 0]

createVariable with fill_value=20, will activate filling, unwritten parts of the data are set to 20, _FillValue attribute is attached

  <class 'netCDF4._netCDF4.Variable'>
  int16 var(index)
      _FillValue: 20
  unlimited dimensions: 
  current shape = (4,)
  filling on
  [5 6 7 --]

Only in case 3 xarray applies the CF-Masking as the _FillValue-attribute is present. There have been discussions if default _FillValues should be taken into account.

#2478
#2374
#2742
#5680

Closing this issue, please comment on #2742.

jp-dark mentioned this issue Dec 9, 2021

dtype not preserved on round trip with xarray TileDB-Inc/TileDB-CF-Py#117

Closed

dcherian added topic-backends topic-CF conventions labels Dec 9, 2021

kmuehlbauer closed this as completed Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected type conversion in variables with _FillValue #6055

Unexpected type conversion in variables with _FillValue #6055

jp-dark commented Dec 9, 2021

dcherian commented Dec 9, 2021

itcarroll commented Dec 10, 2021 •

edited

Loading

itcarroll commented Jan 22, 2022

kmuehlbauer commented Sep 13, 2023

Unexpected type conversion in variables with _FillValue #6055

Unexpected type conversion in variables with _FillValue #6055

Comments

jp-dark commented Dec 9, 2021

dcherian commented Dec 9, 2021

itcarroll commented Dec 10, 2021 • edited Loading

itcarroll commented Jan 22, 2022

kmuehlbauer commented Sep 13, 2023

itcarroll commented Dec 10, 2021 •

edited

Loading