Closed
Description
The following code creates a small (100x100) chunked DataArray
, and then re-indexes it into a huge one (100000x100000):
import xarray as xr
import numpy as np
n = 100
x = np.arange(n)
y = np.arange(n)
da = xr.DataArray(np.zeros(n*n).reshape(n, n), coords=[x, y], dims=['x', 'y']).chunk(n, n)
n2 = 100000
x2 = np.arange(n2)
y2 = np.arange(n2)
da2 = da.reindex({'x': x2, 'y': y2})
da2
But the re-indexed DataArray
has chunksize=(100000, 100000)
instead of chunksize=(100, 100)
:
<xarray.DataArray (x: 100000, y: 100000)>
dask.array<shape=(100000, 100000), dtype=float64, chunksize=(100000, 100000)>
Coordinates:
* x (x) int64 0 1 2 3 4 5 6 ... 99994 99995 99996 99997 99998 99999
* y (y) int64 0 1 2 3 4 5 6 ... 99994 99995 99996 99997 99998 99999
Which immediately leads to a memory error when trying to e.g. store it to a zarr
archive:
ds2 = da2.to_dataset(name='foo')
ds2.to_zarr(store='foo', mode='w')
Trying to re-chunk to 100x100 before storing it doesn't help, but this time it takes a lot more time before crashing with a memory error:
da3 = da2.chunk(n, n)
ds3 = da3.to_dataset(name='foo')
ds3.to_zarr(store='foo', mode='w')