Skip to content

multiple arrays with common nan-shaped dimension #5168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chrisroat opened this issue Apr 16, 2021 · 6 comments
Open

multiple arrays with common nan-shaped dimension #5168

chrisroat opened this issue Apr 16, 2021 · 6 comments

Comments

@chrisroat
Copy link
Contributor

What happened:

When creating a dataset from two variables with a common dimension, there is a TypeError thrown when that dimension has shape nan.

What you expected to happen:

A dataset should be created. I believe dask has an allow_unknown_chunksizes parameter for cases like this -- would that be something that could work here? (Assuming I'm not making a mistake myself.)

Minimal Complete Verifiable Example:

import dask
import dask.array as da
import xarray as xr
import numpy as np

def foo():
    return np.zeros(3)

arr0 = da.from_delayed(dask.delayed(foo)(), shape=(np.nan,), dtype=float)
arr0_xr = xr.DataArray(arr0, dims=('z',))

arr1 = da.from_delayed(dask.delayed(foo)(), shape=(np.nan,), dtype=float)
arr1_xr = xr.DataArray(arr1, dims=('z',))

ds = xr.Dataset({'arr0': arr0_xr, 'arr1': arr0_xr})
stack trace
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/kitchen_sync/xarray/xarray/core/dataarray.py in _getitem_coord(self, key)
    692         try:
--> 693             var = self._coords[key]
    694         except KeyError:

KeyError: 'z'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-4-06b01b94eab3> in <module>
      8 arr1_xr = xr.DataArray(arr1, dims=('z',))
      9 
---> 10 ds = xr.Dataset({'arr0': arr0_xr, 'arr1': arr0_xr})

~/kitchen_sync/xarray/xarray/core/dataset.py in __init__(self, data_vars, coords, attrs)
    739             coords = coords.variables
    740 
--> 741         variables, coord_names, dims, indexes, _ = merge_data_and_coords(
    742             data_vars, coords, compat="broadcast_equals"
    743         )

~/kitchen_sync/xarray/xarray/core/merge.py in merge_data_and_coords(data, coords, compat, join)
    465     explicit_coords = coords.keys()
    466     indexes = dict(_extract_indexes_from_coords(coords))
--> 467     return merge_core(
    468         objects, compat, join, explicit_coords=explicit_coords, indexes=indexes
    469     )

~/kitchen_sync/xarray/xarray/core/merge.py in merge_core(objects, compat, join, combine_attrs, priority_arg, explicit_coords, indexes, fill_value)
    608 
    609     coerced = coerce_pandas_values(objects)
--> 610     aligned = deep_align(
    611         coerced, join=join, copy=False, indexes=indexes, fill_value=fill_value
    612     )

~/kitchen_sync/xarray/xarray/core/alignment.py in deep_align(objects, join, copy, indexes, exclude, raise_on_invalid, fill_value)
    422             out.append(variables)
    423 
--> 424     aligned = align(
    425         *targets,
    426         join=join,

~/kitchen_sync/xarray/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
    283         for dim in obj.dims:
    284             if dim not in exclude:
--> 285                 all_coords[dim].append(obj.coords[dim])
    286                 try:
    287                     index = obj.indexes[dim]

~/kitchen_sync/xarray/xarray/core/coordinates.py in __getitem__(self, key)
    326 
    327     def __getitem__(self, key: Hashable) -> "DataArray":
--> 328         return self._data._getitem_coord(key)
    329 
    330     def _update_coords(

~/kitchen_sync/xarray/xarray/core/dataarray.py in _getitem_coord(self, key)
    694         except KeyError:
    695             dim_sizes = dict(zip(self.dims, self.shape))
--> 696             _, key, var = _get_virtual_variable(
    697                 self._coords, key, self._level_coords, dim_sizes
    698             )

~/kitchen_sync/xarray/xarray/core/dataset.py in _get_virtual_variable(variables, key, level_vars, dim_sizes)
    146 
    147     if key in dim_sizes:
--> 148         data = pd.Index(range(dim_sizes[key]), name=key)
    149         variable = IndexVariable((key,), data)
    150         return key, key, variable

TypeError: 'float' object cannot be interpreted as an integer

Anything else we need to know?:

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:12:38)
[Clang 11.0.1 ]
python-bits: 64
OS: Darwin
OS-release: 20.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.17.1.dev66+g18ed29e4
pandas: 1.2.4
numpy: 1.20.2
scipy: 1.6.2
netCDF4: 1.5.6
pydap: installed
h5netcdf: 0.10.0
h5py: 3.1.0
Nio: None
zarr: 2.7.0
cftime: 1.4.1
nc_time_axis: 1.2.0
PseudoNetCDF: installed
rasterio: None
cfgrib: 0.9.9.0
iris: 2.4.0
bottleneck: 1.3.2
dask: 2021.04.0
distributed: 2021.04.0
matplotlib: 3.4.1
cartopy: 0.18.0
seaborn: 0.11.1
numbagg: installed
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 20.2.4
conda: None
pytest: 6.2.3
IPython: 7.22.0
sphinx: None

@max-sixty
Copy link
Collaborator

max-sixty commented Apr 16, 2021

Currently xarray requires known dimension sizes. Unless anyone has any insight about its interaction with dask that I'm not familiar with? Edit: better informed views below

@keewis
Copy link
Collaborator

keewis commented Apr 16, 2021

this also came up in #4659 and dask/dask#6058. In #4659 we settled for computing the chunksizes for now since supporting unknown chunksizes seems like a bigger change.

@chrisroat
Copy link
Contributor Author

There seems to be some support, but now you have me worried. I have a used xarray mainly for labelling, but not for much computation -- I'm dropping into dask because I need map_overlap.

FWIW, calling dask.compute(arr) works with unknown chunk sizes, but now I see arr.compute() does not. This fooled me into thinking I could use unknown chunk sizes. Now I see that writing to zarr does not work, either. This might torpedo my current design.

I see the compute_chunk_sizes method, but that seems to trigger computation. I'm running on a dask cluster -- is there anything I can do to salvage the pattern arr_with_nan_shape.to_dataset().to_zarr(compute=False) (with our without xarray)?

@dcherian
Copy link
Contributor

I'm not sure about writing to zarr but it seems possible to support nan-sized dimensions when unindexed. We could skip alignment when the dimension is nan-sized for all variables in an Xarray object.

~/kitchen_sync/xarray/xarray/core/alignment.py in align(join, copy, indexes, exclude, fill_value, *objects)
    283         for dim in obj.dims:
    284             if dim not in exclude:
--> 285                 all_coords[dim].append(obj.coords[dim])
    286                 try:
    287                     index = obj.indexes[dim]

For alignment, it may be as easy as adding the name of the nan-sized dimension to exclude.

@chrisroat
Copy link
Contributor Author

It may run even deeper -- there seem to be several checks on dimension sizes that would need special casing. Even simply doing a variable[dim] lookup fails!

@dcherian
Copy link
Contributor

Related: #2801

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants