Writing a netCDF file is slow #6920

lassiterdc · 2022-08-16T14:48:37Z

What is your issue?

This has been discussed in another thread, but the proposed solution there (first .load() the dataset into memory before running to_netcdf) does not work for me since my dataset is too large to fit into memory. The following code takes around 8 hours to run. You'll notice that I tried both xr.open_mfdataset and xr.concat in case it would make a difference, but it doesn't. I also tried profiling the code according to this example. The results are in this html (dropbox link) but I'm not really sure what I'm looking at.

Data: dropbox link to 717 netcdf files containing radar rainfall data for 6/28/2014 over the United States that is around 1GB in total.

Code:

#%% Import libraries
import xarray as xr
from glob import glob
import pandas as pd
import time
import dask
dask.config.set(**{'array.slicing.split_large_chunks': False})

files =  glob("data/*.nc")
#%% functions
def extract_file_timestep(fname):
    fname = fname.split('/')[-1]
    fname = fname.split(".")
    ftype = fname.pop(-1)
    fname = ''.join(fname)
    str_tstep = fname.split("_")[-1]
    if ftype == "nc":
        date_format = '%Y%m%d%H%M'
    if ftype == "grib2":
        date_format = '%Y%m%d-%H%M%S'

    tstep = pd.to_datetime(str_tstep, format=date_format)

    return tstep

def ds_preprocessing(ds):
    tstamp = extract_file_timestep(ds.encoding['source'])
    ds.coords["time"] = tstamp
    ds = ds.expand_dims({"time":1})
    ds = ds.rename({"lon":"longitude", "lat":"latitude", "mrms_a2m":"rainrate"})
    ds = ds.chunk(chunks={"latitude":3500, "longitude":7000, "time":1})
    return ds

#%% Loading and formatting data
lst_ds = []
start_time = time.time()
for f in files:
    ds = xr.open_dataset(f, chunks={"latitude":3500, "longitude":7000})
    ds = ds_preprocessing(ds)
    lst_ds.append(ds)

ds_comb_frm_lst = xr.concat(lst_ds, dim="time")
print("Time to load dataset using concat on list of datasets: {}".format(time.time() - start_time))

start_time = time.time()
ds_comb_frm_open_mfdataset = xr.open_mfdataset(files, chunks={"latitude":3500, "longitude":7000},
                                               concat_dim = "time", preprocess=ds_preprocessing, combine="nested")
print("Time to load dataset using open_mfdataset: {}".format(time.time() - start_time))
#%% exporting to netcdf
start_time = time.time()
ds_comb_frm_lst.to_netcdf("ds_comb_frm_lst.nc", encoding= {"rainrate":{"zlib":True}})
print("Time to export dataset created using concat on list of datasets: {}".format(time.time() - start_time))

start_time = time.time()
ds_comb_frm_open_mfdataset.to_netcdf("ds_comb_frm_open_mfdataset.nc", encoding= {"rainrate":{"zlib":True}})
print("Time to export dataset created using open_mfdataset: {}".format(time.time() - start_time))

The text was updated successfully, but these errors were encountered:

andersy005 · 2022-08-16T15:46:44Z

@lassiterdc, writing large, chunked xarray dataset to a netCDF file is always a challenge and quite slow since the write is serial. However, you could take advantage of the xr.save_mfdataset() function to write to multiple netCDF files. here's a good example that showcase how to achieve this: https://ncar.github.io/esds/posts/2020/writing-multiple-netcdf-files-in-parallel-with-xarray-and-dask

lassiterdc · 2022-08-16T16:59:47Z

Thanks, @andersy005. I think that xr.save_mfdataset() could certainly be helpful in my workflow but unfortunately, I have to consolidate these data from a netcdf for each 2-minute timestep to a netcdf for each day, and it sounds like there's no way around that bottleneck. I've come across suggestions to save the dataset to a zarr group and then export as a netcdf, so I'm going to give that a shot.

andersy005 · 2022-08-16T17:05:24Z

Great... keep us posted once you have a working solution.

I'm going to convert this issue in a discussion instead.

lassiterdc added the needs triage Issue that has not been reviewed by xarray team member label Aug 16, 2022

dcherian added topic-documentation and removed needs triage Issue that has not been reviewed by xarray team member labels Aug 16, 2022

pydata locked and limited conversation to collaborators Aug 16, 2022

andersy005 converted this issue into discussion #6921 Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

This issue was moved to a discussion.

Writing a netCDF file is slow #6920

Writing a netCDF file is slow #6920

lassiterdc commented Aug 16, 2022 •

edited

Loading

andersy005 commented Aug 16, 2022

Uh oh!

lassiterdc commented Aug 16, 2022

Uh oh!

andersy005 commented Aug 16, 2022

Uh oh!

This issue was moved to a discussion.

Uh oh!

This issue was moved to a discussion.

Writing a netCDF file is slow #6920

Writing a netCDF file is slow #6920

Comments

lassiterdc commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is your issue?

andersy005 commented Aug 16, 2022

Uh oh!

lassiterdc commented Aug 16, 2022

Uh oh!

andersy005 commented Aug 16, 2022

Uh oh!

This issue was moved to a discussion.

lassiterdc commented Aug 16, 2022 •

edited

Loading