RFC: Zarr as the default storage #131

markpayneatwork · 2025-01-27T17:05:34Z

Proposal:

Use .zarr files as the default storage for climate variables. Indicators will still be stored as .nc

Background
.zarr is a new file format that does many of the same things that NetCDF does, but is modernised. In particular, .zarr is designed to work better with very large datasets than NetCDF, particularly in cases where file size starts to exceed memory availability. In recent developments in KAPy we have started hitting these problems, with file sizes in excess of 100GB. .zarr may be a solution.

The proposal is to switch storage for climate variables to .zarr, potentially replacing the pickling option currently used to store xarray objects. NetCDF would be retained for storing indicators, as these are generally smaller and fit in memory ok.

Advantages

Enable processing chains where RAM limitations become a problem

Disadvantages

New format, unknown to many (most?)
Doesn't play well with many of the standard tools e.g .CDO, ncview, NCO, ncdump

doblerone · 2025-01-27T17:57:56Z

Some thoughts from my side:

I wouldn't agree that zarr works better with very large datasets than NetCDF. That's mainly a python point of view. But since we are in the python world here, I guess it's valid ;-)
there is this NetCDF zarr implementation:
https://docs.unidata.ucar.edu/netcdf-c/4.9.2/md__media_psf_Home_Desktop_netcdf_releases_v4_9_2_release_netcdf_c_docs_nczarr.html
I have never tried it and do not completely understand the description there, but it MAY be possible that zarr (or nczarr?) files can be handled by standard tools like CDO, nco etc. if compiled with the right flags, libraries etc.
we are doing bias-adjustent for Norway on a 1 km grid, i.e 1100x1500 grid points. The only way to do this, is reading a 30+yr time-series for a point, or a longitude band (1500 grid points) into memory at a time. NetCDF allows sub-section access of files (which can be improved significantly by correct chunking), i.e. you don't need to read/write the whole file into memory if you do not include spatial correlations etc. i.e. you do the bias-adjustment pointwise.

Bottom line: if the possibility of reading data point- or longitude-wise is not an option for you, going for zarr might be a good option. I guess we (I) need to find out how to handle zarr files with our common tools (read CDO) anyway, since it's emerging everywhere.

markpayneatwork · 2025-01-27T21:47:06Z

You've hit the nail on the head here Andreas - chunking is the issue. NetCDF's be chunked to work well with time-axis oriented problems (e.g. bias-correcting a time-series at a given pixel) but unfortunately Python/Xarray doesn't seem to support writing chunks properly at the moment - see eg:
pydata/xarray#8385
Zarr is one solution to this problem - chunking seems to work well there - but as you say, the loss of standard tools is a major inconvenience

doblerone · 2025-01-27T22:24:18Z

I just put a comment there :-)
Had to rechunk a bunch of files recently and found python to outperform nccopy etc.

doblerone · 2025-01-27T22:26:51Z

PS: sitting in Malmö and waiting for the (delayed) night train to Hamburg. Thus reading and answering emails :-)

markpayneatwork · 2025-01-28T09:30:28Z

Funnily enough, I also found this stackoverflow qn last night that looks suspiciously similar to the example you gave in the xarray issue:
https://stackoverflow.com/questions/72893340/change-chunk-block-shape-in-netcdf-file

:-)

I have tried this out today and it appears to be working. If I can get NetCDF chunking to behave itself, it should, in theory allow dask to solve the rest of the problems with bigger-than-memory datasets and therefore remove the need to Zarr. Lets see if it works...

doblerone · 2025-01-28T09:39:28Z

Funnily enough, I also found this stackoverflow qn last night that looks suspiciously similar to the example you gave in the xarray issue: https://stackoverflow.com/questions/72893340/change-chunk-block-shape-in-netcdf-file

:-)

It's also the same username ;-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Zarr as the default storage #131

RFC: Zarr as the default storage #131

markpayneatwork commented Jan 27, 2025

doblerone commented Jan 27, 2025

Uh oh!

markpayneatwork commented Jan 27, 2025

Uh oh!

doblerone commented Jan 27, 2025

Uh oh!

doblerone commented Jan 27, 2025

Uh oh!

markpayneatwork commented Jan 28, 2025

Uh oh!

doblerone commented Jan 28, 2025

Uh oh!

RFC: Zarr as the default storage #131

RFC: Zarr as the default storage #131

Comments

markpayneatwork commented Jan 27, 2025

doblerone commented Jan 27, 2025

Uh oh!

markpayneatwork commented Jan 27, 2025

Uh oh!

doblerone commented Jan 27, 2025

Uh oh!

doblerone commented Jan 27, 2025

Uh oh!

markpayneatwork commented Jan 28, 2025

Uh oh!

doblerone commented Jan 28, 2025

Uh oh!