Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Writing a netCDF file is slow #6920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lassiterdc opened this issue Aug 16, 2022 · 3 comments
Closed

Writing a netCDF file is slow #6920

lassiterdc opened this issue Aug 16, 2022 · 3 comments

Comments

@lassiterdc
Copy link

lassiterdc commented Aug 16, 2022

What is your issue?

This has been discussed in another thread, but the proposed solution there (first .load() the dataset into memory before running to_netcdf) does not work for me since my dataset is too large to fit into memory. The following code takes around 8 hours to run. You'll notice that I tried both xr.open_mfdataset and xr.concat in case it would make a difference, but it doesn't. I also tried profiling the code according to this example. The results are in this html (dropbox link) but I'm not really sure what I'm looking at.

Data: dropbox link to 717 netcdf files containing radar rainfall data for 6/28/2014 over the United States that is around 1GB in total.

Code:

#%% Import libraries
import xarray as xr
from glob import glob
import pandas as pd
import time
import dask
dask.config.set(**{'array.slicing.split_large_chunks': False})

files =  glob("data/*.nc")
#%% functions
def extract_file_timestep(fname):
    fname = fname.split('/')[-1]
    fname = fname.split(".")
    ftype = fname.pop(-1)
    fname = ''.join(fname)
    str_tstep = fname.split("_")[-1]
    if ftype == "nc":
        date_format = '%Y%m%d%H%M'
    if ftype == "grib2":
        date_format = '%Y%m%d-%H%M%S'

    tstep = pd.to_datetime(str_tstep, format=date_format)

    return tstep

def ds_preprocessing(ds):
    tstamp = extract_file_timestep(ds.encoding['source'])
    ds.coords["time"] = tstamp
    ds = ds.expand_dims({"time":1})
    ds = ds.rename({"lon":"longitude", "lat":"latitude", "mrms_a2m":"rainrate"})
    ds = ds.chunk(chunks={"latitude":3500, "longitude":7000, "time":1})
    return ds

#%% Loading and formatting data
lst_ds = []
start_time = time.time()
for f in files:
    ds = xr.open_dataset(f, chunks={"latitude":3500, "longitude":7000})
    ds = ds_preprocessing(ds)
    lst_ds.append(ds)

ds_comb_frm_lst = xr.concat(lst_ds, dim="time")
print("Time to load dataset using concat on list of datasets: {}".format(time.time() - start_time))

start_time = time.time()
ds_comb_frm_open_mfdataset = xr.open_mfdataset(files, chunks={"latitude":3500, "longitude":7000},
                                               concat_dim = "time", preprocess=ds_preprocessing, combine="nested")
print("Time to load dataset using open_mfdataset: {}".format(time.time() - start_time))
#%% exporting to netcdf
start_time = time.time()
ds_comb_frm_lst.to_netcdf("ds_comb_frm_lst.nc", encoding= {"rainrate":{"zlib":True}})
print("Time to export dataset created using concat on list of datasets: {}".format(time.time() - start_time))

start_time = time.time()
ds_comb_frm_open_mfdataset.to_netcdf("ds_comb_frm_open_mfdataset.nc", encoding= {"rainrate":{"zlib":True}})
print("Time to export dataset created using open_mfdataset: {}".format(time.time() - start_time))
@lassiterdc lassiterdc added the needs triage Issue that has not been reviewed by xarray team member label Aug 16, 2022
@andersy005
Copy link
Member

@lassiterdc, writing large, chunked xarray dataset to a netCDF file is always a challenge and quite slow since the write is serial. However, you could take advantage of the xr.save_mfdataset() function to write to multiple netCDF files. here's a good example that showcase how to achieve this: https://ncar.github.io/esds/posts/2020/writing-multiple-netcdf-files-in-parallel-with-xarray-and-dask

@dcherian dcherian added topic-documentation and removed needs triage Issue that has not been reviewed by xarray team member labels Aug 16, 2022
@lassiterdc
Copy link
Author

Thanks, @andersy005. I think that xr.save_mfdataset() could certainly be helpful in my workflow but unfortunately, I have to consolidate these data from a netcdf for each 2-minute timestep to a netcdf for each day, and it sounds like there's no way around that bottleneck. I've come across suggestions to save the dataset to a zarr group and then export as a netcdf, so I'm going to give that a shot.

@andersy005
Copy link
Member

Great... keep us posted once you have a working solution.

I'm going to convert this issue in a discussion instead.

@pydata pydata locked and limited conversation to collaborators Aug 16, 2022
@andersy005 andersy005 converted this issue into discussion #6921 Aug 16, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

3 participants