You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This has been discussed in another thread, but the proposed solution there (first .load() the dataset into memory before running to_netcdf) does not work for me since my dataset is too large to fit into memory. The following code takes around 8 hours to run. You'll notice that I tried both xr.open_mfdataset and xr.concat in case it would make a difference, but it doesn't. I also tried profiling the code according to this example. The results are in this html (dropbox link) but I'm not really sure what I'm looking at.
Data: dropbox link to 717 netcdf files containing radar rainfall data for 6/28/2014 over the United States that is around 1GB in total.
Code:
#%% Import librariesimportxarrayasxrfromglobimportglobimportpandasaspdimporttimeimportdaskdask.config.set(**{'array.slicing.split_large_chunks': False})
files=glob("data/*.nc")
#%% functionsdefextract_file_timestep(fname):
fname=fname.split('/')[-1]
fname=fname.split(".")
ftype=fname.pop(-1)
fname=''.join(fname)
str_tstep=fname.split("_")[-1]
ifftype=="nc":
date_format='%Y%m%d%H%M'ifftype=="grib2":
date_format='%Y%m%d-%H%M%S'tstep=pd.to_datetime(str_tstep, format=date_format)
returntstepdefds_preprocessing(ds):
tstamp=extract_file_timestep(ds.encoding['source'])
ds.coords["time"] =tstampds=ds.expand_dims({"time":1})
ds=ds.rename({"lon":"longitude", "lat":"latitude", "mrms_a2m":"rainrate"})
ds=ds.chunk(chunks={"latitude":3500, "longitude":7000, "time":1})
returnds#%% Loading and formatting datalst_ds= []
start_time=time.time()
forfinfiles:
ds=xr.open_dataset(f, chunks={"latitude":3500, "longitude":7000})
ds=ds_preprocessing(ds)
lst_ds.append(ds)
ds_comb_frm_lst=xr.concat(lst_ds, dim="time")
print("Time to load dataset using concat on list of datasets: {}".format(time.time() -start_time))
start_time=time.time()
ds_comb_frm_open_mfdataset=xr.open_mfdataset(files, chunks={"latitude":3500, "longitude":7000},
concat_dim="time", preprocess=ds_preprocessing, combine="nested")
print("Time to load dataset using open_mfdataset: {}".format(time.time() -start_time))
#%% exporting to netcdfstart_time=time.time()
ds_comb_frm_lst.to_netcdf("ds_comb_frm_lst.nc", encoding= {"rainrate":{"zlib":True}})
print("Time to export dataset created using concat on list of datasets: {}".format(time.time() -start_time))
start_time=time.time()
ds_comb_frm_open_mfdataset.to_netcdf("ds_comb_frm_open_mfdataset.nc", encoding= {"rainrate":{"zlib":True}})
print("Time to export dataset created using open_mfdataset: {}".format(time.time() -start_time))
The text was updated successfully, but these errors were encountered:
Thanks, @andersy005. I think that xr.save_mfdataset() could certainly be helpful in my workflow but unfortunately, I have to consolidate these data from a netcdf for each 2-minute timestep to a netcdf for each day, and it sounds like there's no way around that bottleneck. I've come across suggestions to save the dataset to a zarr group and then export as a netcdf, so I'm going to give that a shot.
Uh oh!
There was an error while loading. Please reload this page.
What is your issue?
This has been discussed in another thread, but the proposed solution there (first
.load()
the dataset into memory before runningto_netcdf
) does not work for me since my dataset is too large to fit into memory. The following code takes around 8 hours to run. You'll notice that I tried bothxr.open_mfdataset
andxr.concat
in case it would make a difference, but it doesn't. I also tried profiling the code according to this example. The results are in this html (dropbox link) but I'm not really sure what I'm looking at.Data: dropbox link to 717 netcdf files containing radar rainfall data for 6/28/2014 over the United States that is around 1GB in total.
Code:
The text was updated successfully, but these errors were encountered: