-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Parallel open_mfdataset #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel open_mfdataset #1983
Conversation
elif engine == 'pynio': | ||
pytest.importorskip('Nio') | ||
else: | ||
pytest.importorskip(engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxim-lian - do you know a better way to skip parameterized tests based on some condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might include the engine as a fixture. This is how we do it with Parquet in dask.dataframe
@pytest.fixture(params=[pytest.mark.skipif(not fastparquet, 'fastparquet',
reason='fastparquet not found'),
pytest.mark.skipif(not pq, 'pyarrow',
reason='pyarrow not found')])
def engine(request):
return request.param
xarray/tests/test_backends.py
Outdated
@requires_scipy | ||
def test_2_autoclose_scipy(self): | ||
self.validate_open_mfdataset_autoclose(engine=['scipy']) | ||
@pytest.fixture(params=[1, 10, 1000]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would also be nice to mark the 1000
entry here as slow.
xarray/backends/api.py
Outdated
autoclose=autoclose, **kwargs) | ||
|
||
if parallel: | ||
import dask.bag as db |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a minor point, but I find dask.delayed()
slightly more general and easier to understand than dask.bag
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I admit I don't fully understand the tradeoffs between these two approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are basically equivalent. The only meaningful difference I think of aside from the API is that dask.delayed has fewer required dependencies (a subset of those required for dask.array):
https://github.com/dask/dask/blob/47d025476312c6a8155192196b8da3b5af0c13e4/setup.py#L10-L17
xarray/backends/api.py
Outdated
import dask.bag as db | ||
paths_bag = db.from_sequence(paths) | ||
datasets = paths_bag.map(open_dataset, | ||
**open_kwargs).compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly does compute()
do when called on a dask bag of xarray datasets of dask arrays? I'm assuming that only the bags get computed, but not the array values? This would be good to clarify in some comments on the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is my understanding. In other words, delayed(open_dataset)(...).compute()
is going to return whatever open_dataset
would have. I have not tested this thoroughly and I would be interested to hear from @mrocklin on how dask is going to treat this situation.
xarray/backends/api.py
Outdated
open_kwargs = dict(engine=engine, chunks=chunks or {}, lock=lock, | ||
autoclose=autoclose, **kwargs) | ||
|
||
if parallel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making the default parallel=None
and using auto-detection logic for setting parallel=True
if using dask-distributed.
@shoyer - I updated this to use dask.delayed. I actually like it more because I only have to call compute once. Thanks for the suggestion. |
xarray/backends/api.py
Outdated
|
||
# calling compute here will return compute the datasets/file_objs lists | ||
# the underlying datasets will still be stored as dask arrays | ||
dask.compute([datasets, file_objs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to assign these computed results, e.g., datasets, file_objs = dask.compute([datasets, file_objs])
?
If anyone understands Windows file handling with Python, I'm all ears as to why this is failing on AppVeyor. I'm tempted to just skip this test there but thought I should ask for help first... |
I've skipped the offending test on appveyor for now. Objectors speak up please. I don't have a windows machine to test on and iterating via appveyor is not something a sane person does 😉. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems good to me!
It's interesting to see that opening metadata is sufficiently painful to engage dask here. Presumably this cost will only grow in time.
xarray/backends/api.py
Outdated
parallel : bool, optional | ||
If True, the open and preprocess steps of this function will be performed | ||
in parallel using ``dask.delayed``. Default is False unless the dask's | ||
distributed scheduler is being used in which case the default is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested this with both a local system and an HPC cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the logic of defaulting to parallel when using distributed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better choice would be to default to true when using distributed or multi-processing? That would probably be a simpler choice.
Actually, this is probably fine even with using multi-threading -- it just won't give you any noticeable speedup. So I guess we could default to parallel=True
.
elif engine == 'pynio': | ||
pytest.importorskip('Nio') | ||
else: | ||
pytest.importorskip(engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might include the engine as a fixture. This is how we do it with Parquet in dask.dataframe
@pytest.fixture(params=[pytest.mark.skipif(not fastparquet, 'fastparquet',
reason='fastparquet not found'),
pytest.mark.skipif(not pq, 'pyarrow',
reason='pyarrow not found')])
def engine(request):
return request.param
I have. See below for a simple example using this feature on Cheyenne. In [1]: import xarray as xr
...:
...: import glob
...:
In [2]: pattern = '/glade/u/home/jhamman/workdir/LOCA_daily/met_data/CESM1-BGC/16th/rcp45/r1i1p1/*/*nc'
In [3]: len(glob.glob(pattern))
Out[3]: 285
In [4]: %time ds = xr.open_mfdataset(pattern)
CPU times: user 15.5 s, sys: 2.62 s, total: 18.1 s
Wall time: 42.4 s
In [5]: ds.close()
In [6]: %time ds = xr.open_mfdataset(pattern, parallel=True)
CPU times: user 18.4 s, sys: 5.28 s, total: 23.6 s
Wall time: 30.7 s
In [7]: ds.close()
In [8]: from dask.distributed import Client
In [9]: client = Client()
clien
In [10]: client
Out[10]: <Client: scheduler='tcp://127.0.0.1:39853' processes=72 cores=72>
In [11]: %time ds = xr.open_mfdataset(pattern, parallel=True, autoclose=True)
CPU times: user 10.8 s, sys: 808 ms, total: 11.6 s
Wall time: 12.4 s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice and clean. It's just hard for me to tell whether your test is actually covering the parallel case explicitly.
xarray/backends/api.py
Outdated
parallel : bool, optional | ||
If True, the open and preprocess steps of this function will be performed | ||
in parallel using ``dask.delayed``. Default is False unless the dask's | ||
distributed scheduler is being used in which case the default is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about the logic of defaulting to parallel when using distributed.
xarray/tests/test_backends.py
Outdated
# check that calculation on opened datasets works properly | ||
actual = open_mfdataset(tmpfiles, engine=readengine, | ||
autoclose=autoclose, | ||
chunks=chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to have a test that explicitly calls open_mfdataset(parallel=parallel)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch :)
…ray into feature/parallel_open_netcdf
My reason for suggesting default |
Can we imagine cases where it might actually degrade performance? |
I image there will be a small performance cost when the number of files is small. That cost is probably lost in the noise in most i/o operations. |
All the tests are passing here? Any final objectors? |
doc/whats-new.rst
Outdated
opening many files, particularly when used in conjunction with | ||
``dask.distributed`` (:issue:`1981`). | ||
By `Joe Hamman <https://github.com/jhamman>`_. | ||
- Some speed improvement to construct :py:class:`~xarray.DataArrayRolling` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This item ended up duplicated when you merged in master.
xarray/backends/api.py
Outdated
# wrap the open_dataset, getattr, and preprocess with delayed | ||
import dask | ||
parallel = True | ||
_open = dask.delayed(open_dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: per PEP8, we usually post-fix with an underscore for variables named to avoid conflicts with builtins, e.g., open_
rather than _open
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, wasn't aware of this detail.
xarray/backends/api.py
Outdated
parallel : bool, optional | ||
If True, the open and preprocess steps of this function will be performed | ||
in parallel using ``dask.delayed``. Default is False unless the dask's | ||
distributed scheduler is being used in which case the default is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a better choice would be to default to true when using distributed or multi-processing? That would probably be a simpler choice.
Actually, this is probably fine even with using multi-threading -- it just won't give you any noticeable speedup. So I guess we could default to parallel=True
.
I recently tried this branch with my data server and got an error. I opened a dataset this way # works fine with parallel=False
ds = xr.open_mfdataset(os.path.join(ddir, '*V1_1.204*.nc'), decode_cf=False, parallel=True) and got the following error.
Without the distributed scheduler (but with Any idea what could be going on? (Sorry for the non-reproducible bug report...I figured some trials "in the field" might be useful.) |
@rabernat - my last commit(s) seem to have broken the CI so I'll need to revisit this. |
@rabernat - I just pushed a few more commits here. Can I ask two questions: When using the distributed scheduler, what configuration are you using? Can you try:
If this turns out to be a corner case with the distributed scheduler, I can add a integration test for that specific use case. |
xarray/tests/test_backends.py
Outdated
reason='requires h5netcdf'), | ||
pytest.mark.skipif(not has_pynio, 'pynio', reason='requires pynio')]) | ||
def readengine(request): | ||
return request.param |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can anyone guess as to why the pynio tests are not being properly skipped using this fixture configuration?
@mrocklin - I think you suggested this approach, which I like, but something weird is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should note, this is working on python 2.7 and 3.4. Just not for 3.5 or 3.6.
@rabernat - I got the tests passing here again. If you can make the time to try your example/test again, it would be great to figure out what wasn't working before. |
Thanks @jhamman for working on this! I did a test on my real world data (1202 ~3mb files) on my local computer and am not getting results I expected:
Am I missing something? nc_files = glob.glob(E.obs['NSIDC_0081']['sipn_nc']+'/*.nc')
print(len(nc_files))
1202
# Parallel False
%time ds = xr.open_mfdataset(nc_files, concat_dim='time', parallel=False, autoclose=True)
CPU times: user 57.8 s, sys: 3.2 s, total: 1min 1s
Wall time: 1min
# Parallel True with default scheduler
%time ds = xr.open_mfdataset(nc_files, concat_dim='time', parallel=True, autoclose=True)
CPU times: user 1min 16s, sys: 9.82 s, total: 1min 26s
Wall time: 1min 16s
# Parallel True with distributed
from dask.distributed import Client
client = Client()
print(client)
<Client: scheduler='tcp://127.0.0.1:43291' processes=16 cores=16>
%time ds = xr.open_mfdataset(nc_files, concat_dim='time', parallel=True, autoclose=True)
CPU times: user 2min 17s, sys: 12.3 s, total: 2min 29s
Wall time: 3min 48s On feature/parallel_open_netcdf commit 280a46f |
@NicWayand - Thanks for giving this a go. Some thoughts on your problem... I'm have been using this feature for the past few days and have been seeing a speedup on datasets with many files along the lines of what I showed above. I am applying my tests on perhaps the perfect test architecture (parallel shared fs, fast interconnect, etc.). I think there are many reasons/cases where this won't work as well. |
It sounds like the right resolution for now would be to leave the default as |
I think that makes sense for now. We need to experiment with this a bit more but I don't see a problem merging the basic workflow we have now (with a minor change to the default behavior). |
With my last commits here, this feature is completely optional and defaults to the current behavior. I cleaned up the tests a bit further and am now ready to merge this. Baring any objections, I'll merge this on Friday. |
whats-new.rst
for all changes andapi.rst
for new APII'm sharing this in the hopes of getting comments from @mrocklin and @pydata/xarray.
What this does:
dask.bag
map/apply on the xarrayopen_dataset
andpreprocess
steps inopen_mfdataset
parallel
option toopen_mfdataset
What it does not do (yet):
autoclose=True
when multiple processes are being use (multiprocessing/distributed scheduler)Benchmark Example