Skip to content

DataSet.chunk and DataArray.chunk handling object coordinates differently #9040

Open
@jeremiah-corrado

Description

@jeremiah-corrado

What happened?

I've been working on creating a ChunkManagerEntrypoint class for Arkouda to interoperate with XArray.

In experimenting with this, I've reimplemented this example, applying .chunk(chunked_array_type="arkouda") both when creating the initial dataset:

ds = xr.tutorial.open_dataset("rasm").load().chunk(chunked_array_type="arkouda")

and doing the step to "calculate the weights by grouping by 'time.season'":

weights = (
  month_length.groupby("time.season") / month_length.groupby("time.season").sum()
).chunk(chunked_array_type="arkouda")

My intention is for the remainder of the code to be exactly the same as the example.

For ds, the chunking operation worked as I expected, where all the numerical arrays were converted to Arkouda Arrays, and the time array was left in its original (numpy array of objects) format:

image

However, when trying to do the same chunking operation with weights (a DataArray), XArray attempts to create an Arkouda array from the time coordinate, resulting in an error.

As a workaround, I can manually create the DataArray, building an arkouda array from weights.data, and copying the coords unchanged:

from arkouda import array_api as xp

weights = month_length.groupby("time.season") / month_length.groupby("time.season").sum()

weights = xr.DataArray(
  xp.asarray(weights.data),
  coords=weights.coords,
)

What did you expect to happen?

I'm not sure whether should be a bug or a feature request, but it would be an improvement for this use-case if DataArray.chunk was able to leave object coordinates in their original format instead of attempting to convert them to the chunked array type. I.e., essentially doing what the manual workaround above does.

Minimal Complete Verifiable Example

No response

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[6], line 16
      6 weights
      8 # # Test that the sum of the weights for each season is 1.0
      9 # np.testing.assert_allclose(weights.groupby("time.season").sum().values, np.ones(4))
     10 
   (...)
     13 # #     coords=weights.coords,
     14 # # )
---> 16 weights = weights.chunk(chunked_array_type="arkouda")
     18 # print(weights.data)
     19 
     20 # ds_weighted = (ds * weightsAk).groupby("time.season").sum(dim="time")
     21 # ds_weighted

File ~/anaconda3/envs/xarray-tests/lib/python3.10/site-packages/xarray/util/deprecation_helpers.py:115, in _deprecate_positional_args.<locals>._decorator.<locals>.inner(*args, **kwargs)
    111     kwargs.update({name: arg for name, arg in zip_args})
    113     return func(*args[:-n_extra_args], **kwargs)
--> 115 return func(*args, **kwargs)

File ~/anaconda3/envs/xarray-tests/lib/python3.10/site-packages/xarray/core/dataarray.py:1419, in DataArray.chunk(self, chunks, name_prefix, token, lock, inline_array, chunked_array_type, from_array_kwargs, **chunks_kwargs)
   1416 else:
   1417     chunks = either_dict_or_kwargs(chunks, chunks_kwargs, "chunk")
-> 1419 ds = self._to_temp_dataset().chunk(
   1420     chunks,
   1421     name_prefix=name_prefix,
   1422     token=token,
   1423     lock=lock,
   1424     inline_array=inline_array,
   1425     chunked_array_type=chunked_array_type,
   1426     from_array_kwargs=from_array_kwargs,
   1427 )
   1428 return self._from_temp_dataset(ds)

File ~/anaconda3/envs/xarray-tests/lib/python3.10/site-packages/xarray/core/dataset.py:2733, in Dataset.chunk(self, chunks, name_prefix, token, lock, inline_array, chunked_array_type, from_array_kwargs, **chunks_kwargs)
   2730 if from_array_kwargs is None:
   2731     from_array_kwargs = {}
-> 2733 variables = {
   2734     k: _maybe_chunk(
   2735         k,
   2736         v,
   2737         chunks_mapping,
   2738         token,
   2739         lock,
   2740         name_prefix,
   2741         inline_array=inline_array,
   2742         chunked_array_type=chunkmanager,
   2743         from_array_kwargs=from_array_kwargs.copy(),
   2744     )
   2745     for k, v in self.variables.items()
   2746 }
   2747 return self._replace(variables)

File ~/anaconda3/envs/xarray-tests/lib/python3.10/site-packages/xarray/core/dataset.py:2734, in <dictcomp>(.0)
   2730 if from_array_kwargs is None:
   2731     from_array_kwargs = {}
   2733 variables = {
-> 2734     k: _maybe_chunk(
   2735         k,
   2736         v,
   2737         chunks_mapping,
   2738         token,
   2739         lock,
   2740         name_prefix,
   2741         inline_array=inline_array,
   2742         chunked_array_type=chunkmanager,
   2743         from_array_kwargs=from_array_kwargs.copy(),
   2744     )
   2745     for k, v in self.variables.items()
   2746 }
   2747 return self._replace(variables)

File ~/anaconda3/envs/xarray-tests/lib/python3.10/site-packages/xarray/core/dataset.py:321, in _maybe_chunk(name, var, chunks, token, lock, name_prefix, overwrite_encoded_chunks, inline_array, chunked_array_type, from_array_kwargs)
    312     name2 = f"{name_prefix}{name}-{token2}"
    314     from_array_kwargs = utils.consolidate_dask_from_array_kwargs(
    315         from_array_kwargs,
    316         name=name2,
    317         lock=lock,
    318         inline_array=inline_array,
    319     )
--> 321 var = var.chunk(
    322     chunks,
    323     chunked_array_type=chunked_array_type,
    324     from_array_kwargs=from_array_kwargs,
    325 )
    327 if overwrite_encoded_chunks and var.chunks is not None:
    328     var.encoding["chunks"] = tuple(x[0] for x in var.chunks)

File ~/anaconda3/envs/xarray-tests/lib/python3.10/site-packages/xarray/core/variable.py:2598, in Variable.chunk(self, chunks, name, lock, inline_array, chunked_array_type, from_array_kwargs, **chunks_kwargs)
   2590 # TODO deprecate passing these dask-specific arguments explicitly. In future just pass everything via from_array_kwargs
   2591 _from_array_kwargs = consolidate_dask_from_array_kwargs(
   2592     from_array_kwargs,
   2593     name=name,
   2594     lock=lock,
   2595     inline_array=inline_array,
   2596 )
-> 2598 return super().chunk(
   2599     chunks=chunks,
   2600     chunked_array_type=chunked_array_type,
   2601     from_array_kwargs=_from_array_kwargs,
   2602     **chunks_kwargs,
   2603 )

File ~/anaconda3/envs/xarray-tests/lib/python3.10/site-packages/xarray/namedarray/core.py:821, in NamedArray.chunk(self, chunks, chunked_array_type, from_array_kwargs, **chunks_kwargs)
    818     if is_dict_like(chunks):
    819         chunks = tuple(chunks.get(n, s) for n, s in enumerate(ndata.shape))  # type: ignore[assignment]
--> 821     data_chunked = chunkmanager.from_array(ndata, chunks, **from_array_kwargs)  # type: ignore[arg-type]
    823 return self._replace(data=data_chunked)

File ~/.../arkouda_xarray/arkoudamanager.py:55, in ArkoudaManager.from_array(self, data, chunks, **kwargs)
     53 print("first elem: ", data[0])
     54 print("type of first elem: ", type(data[0]))
---> 55 arr = aa.zeros(data.shape, dtype=data.dtype)
     57 # copy data into array one chunk at a time
     58 n_chunks_per_dim = [data.shape[i] // chunk_sizes[i] for i in range(data.ndim)]



File ~/.../arkouda/array_api/_creation_functions.py:330, in zeros(shape, dtype, device)
    327 dtype = akdtype(dtype)  # normalize dtype
    328 dtype_name = cast(np.dtype, dtype).name
--> 330 repMsg = generic_msg(
    331     cmd=f"create{ndim}D",
    332     args={
    333         "dtype": dtype_name,
    334         "shape": shape,
    335         "value": 0,
    336     },
    337 )
    339 return Array._new(create_pdarray(repMsg))

File ~/.../arkouda/client.py:1006, in generic_msg(cmd, args, payload, send_binary, recv_binary)
   1004     else:
   1005         assert payload is None
-> 1006         return cast(Channel, channel).send_string_message(
   1007             cmd=cmd, args=msg_args, size=size, recv_binary=recv_binary
   1008         )
   1009 except KeyboardInterrupt as e:
   1010     # if the user interrupts during command execution, the socket gets out
   1011     # of sync reset the socket before raising the interrupt exception
   1012     cast(Channel, channel).connect(timeout=0)

File ~/.../arkouda/client.py:529, in ZmqChannel.send_string_message(self, cmd, recv_binary, args, size, request_id)
    527 # raise errors or warnings sent back from the server
    528 if return_message.msgType == MessageType.ERROR:
--> 529     raise RuntimeError(return_message.msg)
    530 elif return_message.msgType == MessageType.WARNING:
    531     warnings.warn(return_message.msg)

RuntimeError: Error: server not configured to support 'undef' (in createMsg). Please update the configuration and recompile.

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 22.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2024.5.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.0
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.3.0
h5py: 3.11.0
zarr: 2.18.0
cftime: 1.6.3
nc_time_axis: 1.4.1
iris: 3.9.0
bottleneck: 1.3.8
dask: 2024.5.0
distributed: 2024.5.0
matplotlib: 3.8.4
cartopy: 0.23.0
seaborn: 0.13.2
numbagg: 0.8.1
fsspec: 2024.3.1
cupy: None
pint: None
sparse: 0.15.1
flox: 0.9.7
numpy_groupies: 0.11.1
setuptools: 69.5.1
pip: 24.0
conda: None
pytest: 8.2.0
mypy: None
IPython: 8.24.0
sphinx: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugtopic-chunked-arraysManaging different chunked backends, e.g. dask

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions