Skip to content

reindex doesn't preserve chunks #2745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidbrochart opened this issue Feb 5, 2019 · 2 comments
Closed

reindex doesn't preserve chunks #2745

davidbrochart opened this issue Feb 5, 2019 · 2 comments

Comments

@davidbrochart
Copy link
Contributor

The following code creates a small (100x100) chunked DataArray, and then re-indexes it into a huge one (100000x100000):

import xarray as xr
import numpy as np

n = 100
x = np.arange(n)
y = np.arange(n)
da = xr.DataArray(np.zeros(n*n).reshape(n, n), coords=[x, y], dims=['x', 'y']).chunk(n, n)

n2 = 100000
x2 = np.arange(n2)
y2 = np.arange(n2)
da2 = da.reindex({'x': x2, 'y': y2})
da2

But the re-indexed DataArray has chunksize=(100000, 100000) instead of chunksize=(100, 100):

<xarray.DataArray (x: 100000, y: 100000)>
dask.array<shape=(100000, 100000), dtype=float64, chunksize=(100000, 100000)>
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 ... 99994 99995 99996 99997 99998 99999
  * y        (y) int64 0 1 2 3 4 5 6 ... 99994 99995 99996 99997 99998 99999

Which immediately leads to a memory error when trying to e.g. store it to a zarr archive:

ds2 = da2.to_dataset(name='foo')
ds2.to_zarr(store='foo', mode='w')

Trying to re-chunk to 100x100 before storing it doesn't help, but this time it takes a lot more time before crashing with a memory error:

da3 = da2.chunk(n, n)
ds3 = da3.to_dataset(name='foo')
ds3.to_zarr(store='foo', mode='w')
@shoyer
Copy link
Member

shoyer commented Feb 5, 2019

To understand what's going on here, it may be helpful to look at what's going on inside dask:

In [16]: x = np.arange(5)

In [17]: da = xr.DataArray(np.ones(5), coords=[('x', x)]).chunk(-1)

In [18]: da
Out[18]:
<xarray.DataArray (x: 5)>
dask.array<shape=(5,), dtype=float64, chunksize=(5,)>
Coordinates:
  * x        (x) int64 0 1 2 3 4

In [19]: da.reindex({'x': np.arange(20)})
Out[19]:
<xarray.DataArray (x: 20)>
dask.array<shape=(20,), dtype=float64, chunksize=(20,)>
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

In [20]: da.reindex({'x': np.arange(20)}).data.dask
Out[20]: <dask.sharedict.ShareDict at 0x3201d72e8>

In [21]: dict(da.reindex({'x': np.arange(20)}).data.dask)
Out[21]:
{('where-8e0018fae0773d202c09fde132189347', 0): (subgraph_callable,
  ('eq-8167293bb8136be2934a8bf111095d8f', 0),
  array(nan),
  ('getitem-0eab360ba0dee5a5c3fbded0fdfd70e3', 0)),
 ('eq-8167293bb8136be2934a8bf111095d8f', 0): (subgraph_callable,
  ('array-5ddc8bae2e6cf87c0bac846c6da4d27f', 0),
  -1),
 ('array-5ddc8bae2e6cf87c0bac846c6da4d27f',
  0): array([ 0,  1,  2,  3,  4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
        -1, -1, -1]),
 ('xarray-<this-array>-765734894ab8f05a57335ea1064da549',
  0): (<function dask.array.core.getter(a, b, asarray=True, lock=None)>, 'xarray-<this-array>-765734894ab8f05a57335ea1064da549', (slice(0, 5, None),)),
 'xarray-<this-array>-765734894ab8f05a57335ea1064da549': ImplicitToExplicitIndexingAdapter(array=NumpyIndexingAdapter(array=array([1., 1., 1., 1., 1.]))),
 ('getitem-0eab360ba0dee5a5c3fbded0fdfd70e3',
  0): (<function _operator.getitem(a, b, /)>, ('xarray-<this-array>-765734894ab8f05a57335ea1064da549',
   0), (array([0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]),))}

Xarary isn't controlling chunk sizes directly, but it's turns reindex({'x': x2}) into an indexing operation like data[np.array([0, 1, 2, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4])] under the covers, with the last element repeated -- in your case 100000 times. See xarray.Variable._getitem_with_mask for the implementation on the xarray side.

The alternative design would be to append an array of all NaNs along one axis, but on average I think the current implementation is faster and results in more contiguous chunks -- it's quite common to intersperse missing indices with reindex() and alternating indexed/missing values can result in tiny chunks. Even then I think you would probably run into performance issues -- I don't think dask.array.full() uses a default chunk size.

We could also conceivably put some heuristics to control chunking for this in xarray, but I'd rather do it upstream in dask.array, if possible (xarray tries to avoid thinking about chunks).

@phofl
Copy link
Contributor

phofl commented Aug 14, 2024

This was fixed with the latest Dask release 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants