Skip to content

reindex doesn't preserve chunks #2745

Closed
@davidbrochart

Description

@davidbrochart

The following code creates a small (100x100) chunked DataArray, and then re-indexes it into a huge one (100000x100000):

import xarray as xr
import numpy as np

n = 100
x = np.arange(n)
y = np.arange(n)
da = xr.DataArray(np.zeros(n*n).reshape(n, n), coords=[x, y], dims=['x', 'y']).chunk(n, n)

n2 = 100000
x2 = np.arange(n2)
y2 = np.arange(n2)
da2 = da.reindex({'x': x2, 'y': y2})
da2

But the re-indexed DataArray has chunksize=(100000, 100000) instead of chunksize=(100, 100):

<xarray.DataArray (x: 100000, y: 100000)>
dask.array<shape=(100000, 100000), dtype=float64, chunksize=(100000, 100000)>
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6 ... 99994 99995 99996 99997 99998 99999
  * y        (y) int64 0 1 2 3 4 5 6 ... 99994 99995 99996 99997 99998 99999

Which immediately leads to a memory error when trying to e.g. store it to a zarr archive:

ds2 = da2.to_dataset(name='foo')
ds2.to_zarr(store='foo', mode='w')

Trying to re-chunk to 100x100 before storing it doesn't help, but this time it takes a lot more time before crashing with a memory error:

da3 = da2.chunk(n, n)
ds3 = da3.to_dataset(name='foo')
ds3.to_zarr(store='foo', mode='w')

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions