-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
reindex doesn't preserve chunks #2745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To understand what's going on here, it may be helpful to look at what's going on inside dask:
Xarary isn't controlling chunk sizes directly, but it's turns The alternative design would be to append an array of all NaNs along one axis, but on average I think the current implementation is faster and results in more contiguous chunks -- it's quite common to intersperse missing indices with reindex() and alternating indexed/missing values can result in tiny chunks. Even then I think you would probably run into performance issues -- I don't think We could also conceivably put some heuristics to control chunking for this in xarray, but I'd rather do it upstream in dask.array, if possible (xarray tries to avoid thinking about chunks). |
This was fixed with the latest Dask release 🥳 |
The following code creates a small (100x100) chunked
DataArray
, and then re-indexes it into a huge one (100000x100000):But the re-indexed
DataArray
haschunksize=(100000, 100000)
instead ofchunksize=(100, 100)
:Which immediately leads to a memory error when trying to e.g. store it to a
zarr
archive:Trying to re-chunk to 100x100 before storing it doesn't help, but this time it takes a lot more time before crashing with a memory error:
The text was updated successfully, but these errors were encountered: