-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
The method to_netcdf does not preserve chunks #8385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am experimenting the last hours with this too. Here a confirmation : In [43]: xr.open_dataset('SDUds202001010000004231000101MA.nc')['SDU'].encoding
Out[43]:
{'zlib': True,
'szip': False,
..
'contiguous': False,
'chunksizes': (1, 1, 2600),
..
'original_shape': (1, 2600, 2600),
..} Rechunking and writing out sdu20200101_rechunked = xr.open_dataset(
'SDUds202001010000004231000101MA.nc',
decode_cf = True, # useful at this step ?
chunks = {'time': 1, 'lat': 10, 'lon': 10}
)
sdu20200101_rechunked.to_netcdf('sdu20200101_rechunked.nc')
sdu20200101_rechunked.to_zarr('sdu20200101_rechunked.zarr') Diagnosing In [48]: xr.open_dataset('sdu20200101_rechunked.nc').chunksizes
Out[48]: Frozen({})
In [49]: xr.open_zarr('sdu20200101_rechunked.zarr')['SDU'].chunksizes
Out[49]: Frozen({'time': (1,), 'lat': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10), 'lon': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10)}) |
Maybe this should be rephrased as a usage question: why netcdf backends do not pull the chunk information from the variable? At least for my example I can make it work by adding
after |
By the way, I still cannot find a practical way using Python to rechunk data in a NetCDF file. I can do so using the standard |
<!-- Please ensure the PR fulfills the following requirements! --> <!-- If this is your first PR, make sure to add your details to the AUTHORS.rst! --> ### Pull Request Checklist: - [x] This PR addresses an already opened issue (for bug fixes / features) - This PR fixes #xyz - [x] (If applicable) Documentation has been added / updated (for bug fixes / features). - [ ] (If applicable) Tests have been added. - [x] This PR does not seem to break the templates. - [x] CHANGES.rst has been updated (with summary of main changes). - [x] Link to issue (:issue:`number`) and pull request (:pull:`number`) has been added. ### What kind of change does this PR introduce? * `original_shape` and `chunksizes` don't play well together. This PR makes sure that `original_shape` is always removed before saving a dataset. * Also, (maybe new in the latest version of `xarray` and engine `netcdf4`?), it appears that dropping `chunksizes` leads to unexpected behaviours, such as bloated file size and incorrect chunking on disk. Thus, the `chunksizes` encoding was made more explicit. ### Does this PR introduce a breaking change? * No. ### Other information: Related Issues: pydata/xarray#8385 pydata/xarray#8062
Hi. I did recently rechunk a bunch of 25GB NetCDF files from chunks (time,lat,lon): 1,500,1000 to import xarray as xr ds = xr.open_dataset('field_access_opt.nc') # chunks (time,lat,lon): 1,500,1000 #re-chunk variable 'GHI' Original files: Rechunked files: (MEM usage was approx. 2x file size) Feel free to remove this comment if it is not helpful. |
What happened?
Methods
to_zarr
andto_netcdf
behave inconsistently for chunked dataset. The latter does not preserve existing chunk information, the chunks must be specified within theencoding
dictionary.What did you expect to happen?
I expected the behaviour to be consistent for for all
to_XXX()
methods.Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
I did get the same results for
h5netcdf
andscipy
backends, so I am not sure whether this is a bug or not.The above code is a modified version of #2198.
A suggestion: the documentation provides only examples of encoding styles. It would be helpful to provide links to a full specification.
Environment
xarray: 2023.10.1
pandas: 2.1.1
numpy: 1.24.4
scipy: 1.11.3
netCDF4: 1.6.4
pydap: None
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.10.0
distributed: 2023.10.0
matplotlib: 3.8.0
cartopy: 0.22.0
seaborn: None
numbagg: 0.5.1
fsspec: 2023.10.0
cupy: None
pint: None
sparse: 0.14.0
flox: 0.8.1
numpy_groupies: 0.10.2
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: None
mypy: None
IPython: 8.16.1
sphinx: None
The text was updated successfully, but these errors were encountered: