Skip to content

Document using a spawning multiprocessing pool for multiprocessing with dask #1189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue Dec 29, 2016 · 3 comments
Closed
Labels
plan to close May be closeable, needs more eyeballs topic-dask

Comments

@shoyer
Copy link
Member

shoyer commented Dec 29, 2016

This is a nice option for working with in-file HFD5/netCDF4 compression:
#1128 (comment)

Mixed multi-threading/multi-processing could also be interesting, if anyone wants to revive that: dask/dask#457 (I think it would work now that xarray data stores are pickle-able)

CC @mrocklin

@mrocklin
Copy link
Contributor

Can you remind me the motivation to use a spawning multiprocessing pool instead of a fork or forkserver solution?

For mixed multi-threading/multi-processing would a local "distributed" scheduler suffice? This would be several single-threaded processes on a single machine. The scheduler would be aware of data locality and avoid inter-node communication when possible.

@shoyer
Copy link
Member Author

shoyer commented Dec 29, 2016

Actually, I just tested it and it appears that forking also works, as long as you create the pool before opening any files. Otherwise, the netCDF library crashes (#1128 (comment)).

A local "distributed" scheduler might indeed also work, but at least when operating on a single machine it makes sense to bring all data into a single process once it's been loaded for multi-threaded data analysis.

@mrocklin
Copy link
Contributor

Dask.distributed now creates a forkserver at startup. This seems to be working well so far. It nicely balances having a well defined environment and fast startup time.

How much inter-worker data transfer would you expect? It might be worth running through a few classic algorithms with it instead of the threaded scheduler and looking at performance changes. The diagnostic pages would be a nice bonus here and might help to highlight some performance issues.

If anyone is interested in this the thing to do is

$ conda install -c conda-forge dask distributed

>>> from dask.distributed import Client
>>> c = Client()  # sets global scheduler by default

And then operate as normal.

@max-sixty max-sixty added the plan to close May be closeable, needs more eyeballs label Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to close May be closeable, needs more eyeballs topic-dask
Projects
None yet
Development

No branches or pull requests

4 participants