Document using a spawning multiprocessing pool for multiprocessing with dask #1189

shoyer · 2016-12-29T01:21:50Z

This is a nice option for working with in-file HFD5/netCDF4 compression:
#1128 (comment)

Mixed multi-threading/multi-processing could also be interesting, if anyone wants to revive that: dask/dask#457 (I think it would work now that xarray data stores are pickle-able)

CC @mrocklin

mrocklin · 2016-12-29T02:17:40Z

Can you remind me the motivation to use a spawning multiprocessing pool instead of a fork or forkserver solution?

For mixed multi-threading/multi-processing would a local "distributed" scheduler suffice? This would be several single-threaded processes on a single machine. The scheduler would be aware of data locality and avoid inter-node communication when possible.

shoyer · 2016-12-29T02:30:16Z

Actually, I just tested it and it appears that forking also works, as long as you create the pool before opening any files. Otherwise, the netCDF library crashes (#1128 (comment)).

A local "distributed" scheduler might indeed also work, but at least when operating on a single machine it makes sense to bring all data into a single process once it's been loaded for multi-threaded data analysis.

mrocklin · 2016-12-29T02:36:08Z

Dask.distributed now creates a forkserver at startup. This seems to be working well so far. It nicely balances having a well defined environment and fast startup time.

How much inter-worker data transfer would you expect? It might be worth running through a few classic algorithms with it instead of the threaded scheduler and looking at performance changes. The diagnostic pages would be a nice bonus here and might help to highlight some performance issues.

If anyone is interested in this the thing to do is

$ conda install -c conda-forge dask distributed

>>> from dask.distributed import Client
>>> c = Client()  # sets global scheduler by default

And then operate as normal.

jhamman added the topic-dask label Jan 13, 2019

max-sixty added the plan to close May be closeable, needs more eyeballs label Dec 2, 2023

max-sixty closed this as completed Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document using a spawning multiprocessing pool for multiprocessing with dask #1189

Document using a spawning multiprocessing pool for multiprocessing with dask #1189

shoyer commented Dec 29, 2016

mrocklin commented Dec 29, 2016

shoyer commented Dec 29, 2016

mrocklin commented Dec 29, 2016

Document using a spawning multiprocessing pool for multiprocessing with dask #1189

Document using a spawning multiprocessing pool for multiprocessing with dask #1189

Comments

shoyer commented Dec 29, 2016

mrocklin commented Dec 29, 2016

shoyer commented Dec 29, 2016

mrocklin commented Dec 29, 2016