-
Notifications
You must be signed in to change notification settings - Fork 7
RFC: Zarr as the default storage #131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Some thoughts from my side:
Bottom line: if the possibility of reading data point- or longitude-wise is not an option for you, going for zarr might be a good option. I guess we (I) need to find out how to handle zarr files with our common tools (read CDO) anyway, since it's emerging everywhere. |
You've hit the nail on the head here Andreas - chunking is the issue. NetCDF's be chunked to work well with time-axis oriented problems (e.g. bias-correcting a time-series at a given pixel) but unfortunately Python/Xarray doesn't seem to support writing chunks properly at the moment - see eg: |
I just put a comment there :-) |
PS: sitting in Malmö and waiting for the (delayed) night train to Hamburg. Thus reading and answering emails :-) |
Funnily enough, I also found this stackoverflow qn last night that looks suspiciously similar to the example you gave in the xarray issue: :-) I have tried this out today and it appears to be working. If I can get NetCDF chunking to behave itself, it should, in theory allow dask to solve the rest of the problems with bigger-than-memory datasets and therefore remove the need to Zarr. Lets see if it works... |
It's also the same username ;-) |
Proposal:
Use .zarr files as the default storage for climate variables. Indicators will still be stored as .nc
Background
.zarr is a new file format that does many of the same things that NetCDF does, but is modernised. In particular, .zarr is designed to work better with very large datasets than NetCDF, particularly in cases where file size starts to exceed memory availability. In recent developments in KAPy we have started hitting these problems, with file sizes in excess of 100GB. .zarr may be a solution.
The proposal is to switch storage for climate variables to .zarr, potentially replacing the pickling option currently used to store xarray objects. NetCDF would be retained for storing indicators, as these are generally smaller and fit in memory ok.
Advantages
Disadvantages
The text was updated successfully, but these errors were encountered: