-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
open_mfdataset() significantly slower on 0.9.1 vs. 0.8.2 #1301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've also noticed that we have a bottleneck here. @shoyer - any idea what we changed that could impact this? Could this be coming from a change upstream in dask? |
Wow, that is pretty bad. Try setting If that doesn't help, try downgrading dask to see if it's responsible. Profiling results from |
This is what I'm seeing for my
|
Hmm. It might be interesting to try |
My 2cents - I've found that with big files any |
Indeed, this is highly recommended, see http://dask.pydata.org/en/latest/faq.html |
I just tried this on a few different datasets. Comparing python 2.7, xarray 0.7.2, dask 0.7.1 (an old environment I had on hand) with python 2.7, xarray 0.9.1-28-g1cad803, dask 0.13.0 (my current "production" environment), I could not reproduce. The up-to-date stack was faster by a factor of < 2. |
Data: Five files that are approximately 450 MB each. venv1 venv2: I ran the same code in the OP on two conda envs with the same version of dask but two different versions of xarray. There was a significant difference in load time between the two conda envs. I've posted the data on my work site if anyone wants to double check: https://marine.rutgers.edu/~michaesm/netcdf/data/ |
There is definitely something funky with these datasets that is causing xarray to go very slow. This is fast: >>> %time dsets = [xr.open_dataset(fname) for fname in glob('*.nc')]
CPU times: user 1.1 s, sys: 664 ms, total: 1.76 s
Wall time: 1.78 s But even just trying to print the repr is slow >>> %time print(dsets[0])
CPU times: user 3.66 s, sys: 3.49 s, total: 7.15 s
Wall time: 7.28 s Maybe some of this has to do with the change at 0.9.0 to allowing index-less dimensions (i.e. coordinates are optional). All of these datasets have such a dimension, e.g.
|
And the length of >>> for myds in dsets:
print(myds.dims)
Frozen(SortedKeysDict({u'obs': 7537613}))
Frozen(SortedKeysDict({u'obs': 7247697}))
Frozen(SortedKeysDict({u'obs': 7497680}))
Frozen(SortedKeysDict({u'obs': 7661468}))
Frozen(SortedKeysDict({u'obs': 5750197})) |
Looks like the issue might be that xarray 0.9.1 is decoding all timestamps on load. xarray==0.9.1, dask==0.13.0
xarray==0.8.2, dask==0.13.0
The version of dask is held constant in each test. |
@rabernat This data is computed on demand from the OOI (http://oceanobservatories.org/cyberinfrastructure-technology/). Datasets can be massive and so they seem to be split up in ~500 MB files when data gets too big. That is why obs changes for each file. Would having obs be consistent across all files potentially make open_mfdataset faster? |
My understanding is that you are concatenating across the variable My tests showed that it's not necessarily the concat step that is slowing this down. Your profiling suggest that it's a netcdf datetime decoding issue. I wonder if @shoyer or @jhamman have any ideas about how to improve performance here. |
@friedrichknuth Did you try tests with the most recent version |
decode_times=False significantly reduces read time, but the proportional performance discrepancy between xarray 0.8.2 and 0.9.1 remains the same. |
@friedrichknuth, any chance you can take a look at this with the latest v0.10 release candidate? |
Looks like it has been resolved! Tested with the latest pre-release v0.10.0rc2 on the dataset linked by najascutellatus above. https://marine.rutgers.edu/~michaesm/netcdf/data/
xarray==0.10.0rc2-1-g8267fdb
xarray==0.9.1
|
I noticed a big speed discrepancy between xarray versions 0.8.2 and 0.9.1 when using open_mfdataset() on a dataset ~ 1.2 GB in size, consisting of 3 files and using netcdf4 as the engine. 0.8.2 was run first, so this is probably not a disk caching issue.
Test
Result
xarray==0.8.2, dask==0.11.1, netcdf4==1.2.4
xarray==0.9.1, dask==0.13.0, netcdf4==1.2.4
The text was updated successfully, but these errors were encountered: