Skip to content

The method to_netcdf does not preserve chunks #8385

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks done
yt87 opened this issue Oct 27, 2023 · 4 comments
Open
5 tasks done

The method to_netcdf does not preserve chunks #8385

yt87 opened this issue Oct 27, 2023 · 4 comments

Comments

@yt87
Copy link

yt87 commented Oct 27, 2023

What happened?

Methods to_zarr and to_netcdf behave inconsistently for chunked dataset. The latter does not preserve existing chunk information, the chunks must be specified within the encoding dictionary.

What did you expect to happen?

I expected the behaviour to be consistent for for all to_XXX() methods.

Minimal Complete Verifiable Example

import xarray as xr
import dask.array as da

rng = da.random.RandomState()
shape = (20, 20)
chunks = [10, 10]
dims = ["x", "y"]
z = rng.standard_normal(shape, chunks=chunks)
ds = xr.DataArray(z, dims=dims, name="z").to_dataset()
ds.chunks
# This one is rechunked
ds.to_netcdf("/tmp/test1.nc", encoding={"z": {"chunksizes": (5, 5)}})
# This one is not rechunked, also original chunks are lost
ds.chunk({"x": 5, "y": 5}).to_netcdf("/tmp/test2.nc")
# This one is rechunked
ds.chunk({"x": 5, "y": 5}).to_zarr("/tmp/test2", mode="w")

Frozen({'x': (10, 10), 'y': (10, 10)})
<xarray.backends.zarr.ZarrStore at 0x7f3669f1af80>

xr.open_mfdataset("/tmp/test1.nc").chunks
xr.open_mfdataset("/tmp/test2.nc").chunks
xr.open_mfdataset("/tmp/test2", engine="zarr").chunks

Frozen({'x': (5, 5, 5, 5), 'y': (5, 5, 5, 5)})
Frozen({'x': (20,), 'y': (20,)})
Frozen({'x': (5, 5, 5, 5), 'y': (5, 5, 5, 5)})

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

I did get the same results for h5netcdf and scipy backends, so I am not sure whether this is a bug or not.
The above code is a modified version of #2198.
A suggestion: the documentation provides only examples of encoding styles. It would be helpful to provide links to a full specification.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.6 | packaged by conda-forge | (main, Oct 3 2023, 10:40:35) [GCC 12.3.0] python-bits: 64 OS: Linux OS-release: 6.5.5-1-MANJARO machine: x86_64 processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.14.2 libnetcdf: 4.9.2

xarray: 2023.10.1
pandas: 2.1.1
numpy: 1.24.4
scipy: 1.11.3
netCDF4: 1.6.4
pydap: None
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.10.0
distributed: 2023.10.0
matplotlib: 3.8.0
cartopy: 0.22.0
seaborn: None
numbagg: 0.5.1
fsspec: 2023.10.0
cupy: None
pint: None
sparse: 0.14.0
flox: 0.8.1
numpy_groupies: 0.10.2
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: None
mypy: None
IPython: 8.16.1
sphinx: None

@yt87 yt87 added bug needs triage Issue that has not been reviewed by xarray team member labels Oct 27, 2023
@NikosAlexandris
Copy link

NikosAlexandris commented Oct 28, 2023

I am experimenting the last hours with this too. Here a confirmation :

In [43]: xr.open_dataset('SDUds202001010000004231000101MA.nc')['SDU'].encoding
Out[43]:
{'zlib': True,
 'szip': False,
 ..
 'contiguous': False,
 'chunksizes': (1, 1, 2600),
 ..
 'original_shape': (1, 2600, 2600),
 ..}

Rechunking and writing out

sdu20200101_rechunked = xr.open_dataset(
    'SDUds202001010000004231000101MA.nc',
    decode_cf = True,  # useful at this step ?
    chunks = {'time': 1, 'lat': 10, 'lon': 10}
)
sdu20200101_rechunked.to_netcdf('sdu20200101_rechunked.nc')
sdu20200101_rechunked.to_zarr('sdu20200101_rechunked.zarr')

Diagnosing

In [48]: xr.open_dataset('sdu20200101_rechunked.nc').chunksizes
Out[48]: Frozen({})

In [49]: xr.open_zarr('sdu20200101_rechunked.zarr')['SDU'].chunksizes
Out[49]: Frozen({'time': (1,), 'lat': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10), 'lon': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10)})

@yt87
Copy link
Author

yt87 commented Oct 28, 2023

Maybe this should be rephrased as a usage question: why netcdf backends do not pull the chunk information from the variable? At least for my example I can make it work by adding

if 'chunksizes' not in encoding:
    encoding['chunksizes'] = tuple(max(c) for c in variable.chunks)

after else: statement on line 509 in https://github.com/pydata/xarray/blob/main/xarray/backends/netCDF4_.py#L509
I suspect there must be a reason why it can't be done.

@NikosAlexandris
Copy link

NikosAlexandris commented Oct 31, 2023

By the way, I still cannot find a practical way using Python to rechunk data in a NetCDF file. I can do so using the standard nccopy utility.

RondeauG added a commit to Ouranosinc/xscen that referenced this issue Apr 11, 2024
<!-- Please ensure the PR fulfills the following requirements! -->
<!-- If this is your first PR, make sure to add your details to the
AUTHORS.rst! -->
### Pull Request Checklist:
- [x] This PR addresses an already opened issue (for bug fixes /
features)
    - This PR fixes #xyz
- [x] (If applicable) Documentation has been added / updated (for bug
fixes / features).
- [ ] (If applicable) Tests have been added.
- [x] This PR does not seem to break the templates.
- [x] CHANGES.rst has been updated (with summary of main changes).
- [x] Link to issue (:issue:`number`) and pull request (:pull:`number`)
has been added.

### What kind of change does this PR introduce?

* `original_shape` and `chunksizes` don't play well together. This PR
makes sure that `original_shape` is always removed before saving a
dataset.
* Also, (maybe new in the latest version of `xarray` and engine
`netcdf4`?), it appears that dropping `chunksizes` leads to unexpected
behaviours, such as bloated file size and incorrect chunking on disk.
Thus, the `chunksizes` encoding was made more explicit.

### Does this PR introduce a breaking change?

* No.


### Other information:

Related Issues:
pydata/xarray#8385
pydata/xarray#8062
@TomNicholas TomNicholas added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 17, 2024
@doblerone
Copy link

By the way, I still cannot find a practical way using Python to rechunk data in a NetCDF file. I can do so using the standard nccopy utility.

Hi. I did recently rechunk a bunch of 25GB NetCDF files from chunks (time,lat,lon): 1,500,1000 to
chunks (time,lat,lon): 8760,1,1
using

import xarray as xr

ds = xr.open_dataset('field_access_opt.nc') # chunks (time,lat,lon): 1,500,1000

#re-chunk variable 'GHI'
ds.to_netcdf("point_access_opt.nc",
encoding={'lat': {'zlib': False, '_FillValue': None},
'lon': {'zlib': False, '_FillValue': None},
'time': {'zlib': False, '_FillValue': None, 'dtype': 'double'},
'GHI': {'chunksizes': [len(ds['GHI'].time),1,1], 'zlib': True,
'complevel': 1}})

Original files:
https://thredds.met.no/thredds/catalog/sunpoint/ML-Optimized-Maps/hourly/field_access/catalog.html

Rechunked files:
https://thredds.met.no/thredds/catalog/sunpoint/ML-Optimized-Maps/hourly/point_access/catalog.html

(MEM usage was approx. 2x file size)

Feel free to remove this comment if it is not helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants