Skip to content

Writing Datasets to netCDF4 with "inconsistent" chunks #2254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
neishm opened this issue Jun 27, 2018 · 3 comments
Closed

Writing Datasets to netCDF4 with "inconsistent" chunks #2254

neishm opened this issue Jun 27, 2018 · 3 comments

Comments

@neishm
Copy link
Contributor

neishm commented Jun 27, 2018

Code Sample

import xarray as xr
from dask.array import zeros, ones

# Construct two variables with the same dimensions, but different chunking
x = zeros((100,100),dtype='f4',chunks=(50,100))
x = xr.DataArray(data=x, dims=('lat','lon'), name='x')
y = ones((100,100),dtype='f4',chunks=(100,50))
y = xr.DataArray(data=y, dims=('lat','lon'), name='y')

# Put them both into the same dataset
dset = xr.Dataset({'x':x,'y':y})

# Save to a netCDF4 file.
dset.to_netcdf("test.nc")

The last line results in

ValueError: inconsistent chunks

Problem description

This error is triggered by xarray.backends.api.to_netcdf's use of the dataset.chunks property in two places:

if (dataset.chunks and scheduler in ['distributed', 'multiprocessing'] and

autoclose = (dataset.chunks and

I'm assuming to_netcdf only needs to know if chunks are being used, not necessarily if they're consistent?

If I define a more general check

have_chunks = any(v.chunks for v in dataset.variables.values())

and replace the instances of dataset.chunks with have_chunks, then the netCDF4 file gets written without any problems (although the data seems to be stored contiguously instead of chunked).

Is this change as straight-forward as I think, or Is there something intrinsic about xarray.Dataset objects or writing to netCDF4 that require consistent chunks?

Output of xr.show_versions()

commit: bb581ca
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-128-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

xarray: 0.10.7
pandas: 0.23.1
numpy: 1.14.5
scipy: None
netCDF4: 1.4.0
h5netcdf: None
h5py: None
Nio: None
zarr: None
bottleneck: None
cyordereddict: None
dask: 0.17.5
distributed: None
matplotlib: None
cartopy: None
seaborn: None
setuptools: 39.2.0
pip: 10.0.1
conda: None
pytest: None
IPython: None
sphinx: None

@shoyer
Copy link
Member

shoyer commented Jun 27, 2018

For reference, here's the full traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-6a835b914234> in <module>()
     12
     13 # Save to a netCDF4 file.
---> 14 dset.to_netcdf("test.nc")

~/dev/xarray/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute)
   1148                          engine=engine, encoding=encoding,
   1149                          unlimited_dims=unlimited_dims,
-> 1150                          compute=compute)
   1151
   1152     def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,

~/dev/xarray/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, writer, encoding, unlimited_dims, compute)
    701     # handle scheduler specific logic
    702     scheduler = get_scheduler()
--> 703     if (dataset.chunks and scheduler in ['distributed', 'multiprocessing'] and
    704             engine != 'netcdf4'):
    705         raise NotImplementedError("Writing netCDF files with the %s backend "

~/dev/xarray/xarray/core/dataset.py in chunks(self)
   1237                 for dim, c in zip(v.dims, v.chunks):
   1238                     if dim in chunks and c != chunks[dim]:
-> 1239                         raise ValueError('inconsistent chunks')
   1240                     chunks[dim] = c
   1241         return Frozen(SortedKeysDict(chunks))

ValueError: inconsistent chunks

So yes, it looks like we could fix this by checking chunks on each array independently like you suggest. There's no reason why all dask arrays need to have the same chunking for storing with to_netcdf().

and replace the instances of dataset.chunks with have_chunks, then the netCDF4 file gets written without any problems (although the data seems to be stored contiguously instead of chunked).

This is because you need to indicate chunks for variables separately, via encoding: http://xarray.pydata.org/en/stable/io.html#writing-encoded-data

@neishm
Copy link
Contributor Author

neishm commented Jun 27, 2018

So yes, it looks like we could fix this by checking chunks on each array independently like you suggest. There's no reason why all dask arrays need to have the same chunking for storing with to_netcdf().

I could throw together a pull request if that's all that's involved.

This is because you need to indicate chunks for variables separately, via encoding: http://xarray.pydata.org/en/stable/io.html#writing-encoded-data

Thanks! I was able to write chunked output the netCDF file by adding chunksizes to the encoding attribute of the variables. I found I also had to specify original_shape as a workaround for #2198.

@shoyer
Copy link
Member

shoyer commented Jun 27, 2018 via email

neishm pushed a commit to neishm/xarray that referenced this issue Jun 28, 2018
neishm pushed a commit to neishm/xarray that referenced this issue Jun 29, 2018
shoyer pushed a commit that referenced this issue Jun 29, 2018
* Test case for writing Datasets to netCDF4 where each DataArray has different chunk sizes.

* When writing Datasets to netCDF4, don't need the chunk sizes to be consistent over all arrays.

Closes #2254.

* Added a note about the bugfix for #2254.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants