Skip to content

RFC: Zarr as the default storage #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
markpayneatwork opened this issue Jan 27, 2025 · 6 comments
Open

RFC: Zarr as the default storage #131

markpayneatwork opened this issue Jan 27, 2025 · 6 comments

Comments

@markpayneatwork
Copy link
Contributor

Proposal:

Use .zarr files as the default storage for climate variables. Indicators will still be stored as .nc

Background
.zarr is a new file format that does many of the same things that NetCDF does, but is modernised. In particular, .zarr is designed to work better with very large datasets than NetCDF, particularly in cases where file size starts to exceed memory availability. In recent developments in KAPy we have started hitting these problems, with file sizes in excess of 100GB. .zarr may be a solution.

The proposal is to switch storage for climate variables to .zarr, potentially replacing the pickling option currently used to store xarray objects. NetCDF would be retained for storing indicators, as these are generally smaller and fit in memory ok.

Advantages

  • Enable processing chains where RAM limitations become a problem

Disadvantages

  • New format, unknown to many (most?)
  • Doesn't play well with many of the standard tools e.g .CDO, ncview, NCO, ncdump
@doblerone
Copy link
Collaborator

Some thoughts from my side:

  • I wouldn't agree that zarr works better with very large datasets than NetCDF. That's mainly a python point of view. But since we are in the python world here, I guess it's valid ;-)
  • there is this NetCDF zarr implementation:
    https://docs.unidata.ucar.edu/netcdf-c/4.9.2/md__media_psf_Home_Desktop_netcdf_releases_v4_9_2_release_netcdf_c_docs_nczarr.html
    I have never tried it and do not completely understand the description there, but it MAY be possible that zarr (or nczarr?) files can be handled by standard tools like CDO, nco etc. if compiled with the right flags, libraries etc.
  • we are doing bias-adjustent for Norway on a 1 km grid, i.e 1100x1500 grid points. The only way to do this, is reading a 30+yr time-series for a point, or a longitude band (1500 grid points) into memory at a time. NetCDF allows sub-section access of files (which can be improved significantly by correct chunking), i.e. you don't need to read/write the whole file into memory if you do not include spatial correlations etc. i.e. you do the bias-adjustment pointwise.

Bottom line: if the possibility of reading data point- or longitude-wise is not an option for you, going for zarr might be a good option. I guess we (I) need to find out how to handle zarr files with our common tools (read CDO) anyway, since it's emerging everywhere.

@markpayneatwork
Copy link
Contributor Author

You've hit the nail on the head here Andreas - chunking is the issue. NetCDF's be chunked to work well with time-axis oriented problems (e.g. bias-correcting a time-series at a given pixel) but unfortunately Python/Xarray doesn't seem to support writing chunks properly at the moment - see eg:
pydata/xarray#8385
Zarr is one solution to this problem - chunking seems to work well there - but as you say, the loss of standard tools is a major inconvenience

@doblerone
Copy link
Collaborator

I just put a comment there :-)
Had to rechunk a bunch of files recently and found python to outperform nccopy etc.

@doblerone
Copy link
Collaborator

PS: sitting in Malmö and waiting for the (delayed) night train to Hamburg. Thus reading and answering emails :-)

@markpayneatwork
Copy link
Contributor Author

Funnily enough, I also found this stackoverflow qn last night that looks suspiciously similar to the example you gave in the xarray issue:
https://stackoverflow.com/questions/72893340/change-chunk-block-shape-in-netcdf-file

:-)

I have tried this out today and it appears to be working. If I can get NetCDF chunking to behave itself, it should, in theory allow dask to solve the rest of the problems with bigger-than-memory datasets and therefore remove the need to Zarr. Lets see if it works...

@doblerone
Copy link
Collaborator

Funnily enough, I also found this stackoverflow qn last night that looks suspiciously similar to the example you gave in the xarray issue: https://stackoverflow.com/questions/72893340/change-chunk-block-shape-in-netcdf-file

:-)

It's also the same username ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants