Writing Datasets to netCDF4 with "inconsistent" chunks #2254

neishm · 2018-06-27T15:15:02Z

Code Sample

import xarray as xr
from dask.array import zeros, ones

# Construct two variables with the same dimensions, but different chunking
x = zeros((100,100),dtype='f4',chunks=(50,100))
x = xr.DataArray(data=x, dims=('lat','lon'), name='x')
y = ones((100,100),dtype='f4',chunks=(100,50))
y = xr.DataArray(data=y, dims=('lat','lon'), name='y')

# Put them both into the same dataset
dset = xr.Dataset({'x':x,'y':y})

# Save to a netCDF4 file.
dset.to_netcdf("test.nc")

The last line results in

ValueError: inconsistent chunks

Problem description

This error is triggered by xarray.backends.api.to_netcdf's use of the dataset.chunks property in two places:

xarray/xarray/backends/api.py

Line 703 in bb581ca

if (dataset.chunks and scheduler in ['distributed', 'multiprocessing'] and

xarray/xarray/backends/api.py

Line 709 in bb581ca

autoclose = (dataset.chunks and

I'm assuming to_netcdf only needs to know if chunks are being used, not necessarily if they're consistent?

If I define a more general check

have_chunks = any(v.chunks for v in dataset.variables.values())

and replace the instances of dataset.chunks with have_chunks, then the netCDF4 file gets written without any problems (although the data seems to be stored contiguously instead of chunked).

Is this change as straight-forward as I think, or Is there something intrinsic about xarray.Dataset objects or writing to netCDF4 that require consistent chunks?

Output of `xr.show_versions()`

commit: bb581ca
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-128-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

xarray: 0.10.7
pandas: 0.23.1
numpy: 1.14.5
scipy: None
netCDF4: 1.4.0
h5netcdf: None
h5py: None
Nio: None
zarr: None
bottleneck: None
cyordereddict: None
dask: 0.17.5
distributed: None
matplotlib: None
cartopy: None
seaborn: None
setuptools: 39.2.0
pip: 10.0.1
conda: None
pytest: None
IPython: None
sphinx: None

The text was updated successfully, but these errors were encountered:

shoyer · 2018-06-27T20:11:35Z

For reference, here's the full traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-6a835b914234> in <module>()
     12
     13 # Save to a netCDF4 file.
---> 14 dset.to_netcdf("test.nc")

~/dev/xarray/xarray/core/dataset.py in to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute)
   1148                          engine=engine, encoding=encoding,
   1149                          unlimited_dims=unlimited_dims,
-> 1150                          compute=compute)
   1151
   1152     def to_zarr(self, store=None, mode='w-', synchronizer=None, group=None,

~/dev/xarray/xarray/backends/api.py in to_netcdf(dataset, path_or_file, mode, format, group, engine, writer, encoding, unlimited_dims, compute)
    701     # handle scheduler specific logic
    702     scheduler = get_scheduler()
--> 703     if (dataset.chunks and scheduler in ['distributed', 'multiprocessing'] and
    704             engine != 'netcdf4'):
    705         raise NotImplementedError("Writing netCDF files with the %s backend "

~/dev/xarray/xarray/core/dataset.py in chunks(self)
   1237                 for dim, c in zip(v.dims, v.chunks):
   1238                     if dim in chunks and c != chunks[dim]:
-> 1239                         raise ValueError('inconsistent chunks')
   1240                     chunks[dim] = c
   1241         return Frozen(SortedKeysDict(chunks))

ValueError: inconsistent chunks

So yes, it looks like we could fix this by checking chunks on each array independently like you suggest. There's no reason why all dask arrays need to have the same chunking for storing with to_netcdf().

and replace the instances of dataset.chunks with have_chunks, then the netCDF4 file gets written without any problems (although the data seems to be stored contiguously instead of chunked).

This is because you need to indicate chunks for variables separately, via encoding: http://xarray.pydata.org/en/stable/io.html#writing-encoded-data

neishm · 2018-06-27T20:53:27Z

So yes, it looks like we could fix this by checking chunks on each array independently like you suggest. There's no reason why all dask arrays need to have the same chunking for storing with to_netcdf().

I could throw together a pull request if that's all that's involved.

This is because you need to indicate chunks for variables separately, via encoding: http://xarray.pydata.org/en/stable/io.html#writing-encoded-data

Thanks! I was able to write chunked output the netCDF file by adding chunksizes to the encoding attribute of the variables. I found I also had to specify original_shape as a workaround for #2198.

shoyer · 2018-06-27T23:18:18Z

Yes, a pull request would be appreciated!

…

On Wed, Jun 27, 2018 at 1:53 PM Mike Neish ***@***.***> wrote: So yes, it looks like we could fix this by checking chunks on each array independently like you suggest. There's no reason why all dask arrays need to have the same chunking for storing with to_netcdf(). I could throw together a pull request if that's all that's involved. This is because you need to indicate chunks for variables separately, via encoding: http://xarray.pydata.org/en/stable/io.html#writing-encoded-data Thanks! I was able to write chunked output the netCDF file by adding chunksizes to the encoding attribute of the variables. I found I also had to specify original_shape as a workaround for #2198 <#2198>. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2254 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1nMYi7kPsJImZLwiHGKqpVHH_UlAks5uA_DIgaJpZM4U551i> .

…nsistent over all arrays. Closes pydata#2254.

* Test case for writing Datasets to netCDF4 where each DataArray has different chunk sizes. * When writing Datasets to netCDF4, don't need the chunk sizes to be consistent over all arrays. Closes #2254. * Added a note about the bugfix for #2254.

neishm pushed a commit to neishm/xarray that referenced this issue Jun 28, 2018

When writing Datasets to netCDF4, don't need the chunk sizes to be co…

f3ed23a

…nsistent over all arrays. Closes pydata#2254.

neishm mentioned this issue Jun 28, 2018

Write inconsistent chunks to netcdf #2257

Merged

3 tasks

neishm pushed a commit to neishm/xarray that referenced this issue Jun 29, 2018

Added a note about the bugfix for pydata#2254.

5ad03be

shoyer closed this as completed in #2257 Jun 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing Datasets to netCDF4 with "inconsistent" chunks #2254

Writing Datasets to netCDF4 with "inconsistent" chunks #2254

neishm commented Jun 27, 2018

shoyer commented Jun 27, 2018

neishm commented Jun 27, 2018

shoyer commented Jun 27, 2018 via email

Writing Datasets to netCDF4 with "inconsistent" chunks #2254

Writing Datasets to netCDF4 with "inconsistent" chunks #2254

Comments

neishm commented Jun 27, 2018

Code Sample

Problem description

Output of xr.show_versions()

shoyer commented Jun 27, 2018

neishm commented Jun 27, 2018

shoyer commented Jun 27, 2018 via email

Output of `xr.show_versions()`