The method to_netcdf does not preserve chunks #8385

yt87 · 2023-10-27T22:29:45Z

What happened?

Methods to_zarr and to_netcdf behave inconsistently for chunked dataset. The latter does not preserve existing chunk information, the chunks must be specified within the encoding dictionary.

What did you expect to happen?

I expected the behaviour to be consistent for for all to_XXX() methods.

Minimal Complete Verifiable Example

import xarray as xr
import dask.array as da

rng = da.random.RandomState()
shape = (20, 20)
chunks = [10, 10]
dims = ["x", "y"]
z = rng.standard_normal(shape, chunks=chunks)
ds = xr.DataArray(z, dims=dims, name="z").to_dataset()
ds.chunks
# This one is rechunked
ds.to_netcdf("/tmp/test1.nc", encoding={"z": {"chunksizes": (5, 5)}})
# This one is not rechunked, also original chunks are lost
ds.chunk({"x": 5, "y": 5}).to_netcdf("/tmp/test2.nc")
# This one is rechunked
ds.chunk({"x": 5, "y": 5}).to_zarr("/tmp/test2", mode="w")

Frozen({'x': (10, 10), 'y': (10, 10)})
<xarray.backends.zarr.ZarrStore at 0x7f3669f1af80>

xr.open_mfdataset("/tmp/test1.nc").chunks
xr.open_mfdataset("/tmp/test2.nc").chunks
xr.open_mfdataset("/tmp/test2", engine="zarr").chunks

Frozen({'x': (5, 5, 5, 5), 'y': (5, 5, 5, 5)})
Frozen({'x': (20,), 'y': (20,)})
Frozen({'x': (5, 5, 5, 5), 'y': (5, 5, 5, 5)})

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

I did get the same results for h5netcdf and scipy backends, so I am not sure whether this is a bug or not.
The above code is a modified version of #2198.
A suggestion: the documentation provides only examples of encoding styles. It would be helpful to provide links to a full specification.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.5-1-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2

xarray: 2023.10.1
pandas: 2.1.1
numpy: 1.24.4
scipy: 1.11.3
netCDF4: 1.6.4
pydap: None
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.10.0
distributed: 2023.10.0
matplotlib: 3.8.0
cartopy: 0.22.0
seaborn: None
numbagg: 0.5.1
fsspec: 2023.10.0
cupy: None
pint: None
sparse: 0.14.0
flox: 0.8.1
numpy_groupies: 0.10.2
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: None
mypy: None
IPython: 8.16.1
sphinx: None

The text was updated successfully, but these errors were encountered:

NikosAlexandris · 2023-10-28T12:40:41Z

I am experimenting the last hours with this too. Here a confirmation :

In [43]: xr.open_dataset('SDUds202001010000004231000101MA.nc')['SDU'].encoding
Out[43]:
{'zlib': True,
 'szip': False,
 ..
 'contiguous': False,
 'chunksizes': (1, 1, 2600),
 ..
 'original_shape': (1, 2600, 2600),
 ..}

Rechunking and writing out

sdu20200101_rechunked = xr.open_dataset(
    'SDUds202001010000004231000101MA.nc',
    decode_cf = True,  # useful at this step ?
    chunks = {'time': 1, 'lat': 10, 'lon': 10}
)
sdu20200101_rechunked.to_netcdf('sdu20200101_rechunked.nc')
sdu20200101_rechunked.to_zarr('sdu20200101_rechunked.zarr')

Diagnosing

In [48]: xr.open_dataset('sdu20200101_rechunked.nc').chunksizes
Out[48]: Frozen({})

In [49]: xr.open_zarr('sdu20200101_rechunked.zarr')['SDU'].chunksizes
Out[49]: Frozen({'time': (1,), 'lat': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10), 'lon': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10)})

yt87 · 2023-10-28T21:09:03Z

Maybe this should be rephrased as a usage question: why netcdf backends do not pull the chunk information from the variable? At least for my example I can make it work by adding

if 'chunksizes' not in encoding:
    encoding['chunksizes'] = tuple(max(c) for c in variable.chunks)

after else: statement on line 509 in https://github.com/pydata/xarray/blob/main/xarray/backends/netCDF4_.py#L509
I suspect there must be a reason why it can't be done.

NikosAlexandris · 2023-10-31T18:51:29Z

By the way, I still cannot find a practical way using Python to rechunk data in a NetCDF file. I can do so using the standard nccopy utility.

### Pull Request Checklist: - [x] This PR addresses an already opened issue (for bug fixes / features) - This PR fixes #xyz - [x] (If applicable) Documentation has been added / updated (for bug fixes / features). - [ ] (If applicable) Tests have been added. - [x] This PR does not seem to break the templates. - [x] CHANGES.rst has been updated (with summary of main changes). - [x] Link to issue (:issue:`number`) and pull request (:pull:`number`) has been added. ### What kind of change does this PR introduce? * `original_shape` and `chunksizes` don't play well together. This PR makes sure that `original_shape` is always removed before saving a dataset. * Also, (maybe new in the latest version of `xarray` and engine `netcdf4`?), it appears that dropping `chunksizes` leads to unexpected behaviours, such as bloated file size and incorrect chunking on disk. Thus, the `chunksizes` encoding was made more explicit. ### Does this PR introduce a breaking change? * No. ### Other information: Related Issues: pydata/xarray#8385 pydata/xarray#8062

doblerone · 2025-01-27T22:21:31Z

By the way, I still cannot find a practical way using Python to rechunk data in a NetCDF file. I can do so using the standard nccopy utility.

Hi. I did recently rechunk a bunch of 25GB NetCDF files from chunks (time,lat,lon): 1,500,1000 to
chunks (time,lat,lon): 8760,1,1
using

import xarray as xr

ds = xr.open_dataset('field_access_opt.nc') # chunks (time,lat,lon): 1,500,1000

#re-chunk variable 'GHI'
ds.to_netcdf("point_access_opt.nc",
encoding={'lat': {'zlib': False, '_FillValue': None},
'lon': {'zlib': False, '_FillValue': None},
'time': {'zlib': False, '_FillValue': None, 'dtype': 'double'},
'GHI': {'chunksizes': [len(ds['GHI'].time),1,1], 'zlib': True,
'complevel': 1}})

Original files:
https://thredds.met.no/thredds/catalog/sunpoint/ML-Optimized-Maps/hourly/field_access/catalog.html

Rechunked files:
https://thredds.met.no/thredds/catalog/sunpoint/ML-Optimized-Maps/hourly/point_access/catalog.html

(MEM usage was approx. 2x file size)

Feel free to remove this comment if it is not helpful.

yt87 added bug needs triage Issue that has not been reviewed by xarray team member labels Oct 27, 2023

RondeauG mentioned this issue Apr 8, 2024

Ensure that chunking is respected Ouranosinc/xscen#379

Merged

6 tasks

TomNicholas added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 17, 2024

markpayneatwork mentioned this issue Jan 27, 2025

RFC: Zarr as the default storage Klimaatlas/KAPy#131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The method to_netcdf does not preserve chunks #8385

The method to_netcdf does not preserve chunks #8385

yt87 commented Oct 27, 2023

NikosAlexandris commented Oct 28, 2023 •

edited

Loading

yt87 commented Oct 28, 2023

NikosAlexandris commented Oct 31, 2023 •

edited

Loading

doblerone commented Jan 27, 2025

The method to_netcdf does not preserve chunks #8385

The method to_netcdf does not preserve chunks #8385

Comments

yt87 commented Oct 27, 2023

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

NikosAlexandris commented Oct 28, 2023 • edited Loading

yt87 commented Oct 28, 2023

NikosAlexandris commented Oct 31, 2023 • edited Loading

doblerone commented Jan 27, 2025

NikosAlexandris commented Oct 28, 2023 •

edited

Loading

NikosAlexandris commented Oct 31, 2023 •

edited

Loading