|
1 | 1 | (algorithms)= |
2 | 2 | # Parallel Algorithms |
3 | 3 |
|
4 | | -`flox` outsources the core GroupBy operation to the vectorized implementations in |
5 | | -[numpy_groupies](https://github.com/ml31415/numpy-groupies). |
6 | | - |
7 | | -Running an efficient groupby reduction in parallel is hard, and strongly depends on how the |
8 | | -groups are distributed amongst the blocks of an array. |
| 4 | +`flox` outsources the core GroupBy operation to the vectorized implementations controlled by the |
| 5 | +[`engine` kwarg](engines.md). Applying these implementations on a parallel array type like dask |
| 6 | +can be hard. Performance strongly depends on how the groups are distributed amongst the blocks of an array. |
9 | 7 |
|
10 | 8 | `flox` implements 4 strategies for grouped reductions, each is appropriate for a particular distribution of groups |
11 | 9 | among the blocks of a dask array. Switch between the various strategies by passing `method` |
12 | | -and/or `reindex` to either {py:func}`flox.core.groupby_reduce` or `xarray_reduce`. |
| 10 | +and/or `reindex` to either {py:func}`flox.groupby_reduce` or {py:func}`flox.xarray.xarray_reduce`. |
13 | 11 |
|
14 | 12 | Your options are: |
15 | 13 | 1. `method="map-reduce"` with `reindex=False` |
16 | 14 | 1. `method="map-reduce"` with `reindex=True` |
17 | | -1. `method="blockwise"` |
18 | | -1. `method="cohorts"` |
| 15 | +1. [`method="blockwise"`](method-blockwise) |
| 16 | +1. [`method="cohorts"`](method-cohorts) |
19 | 17 |
|
20 | 18 | The most appropriate strategy for your problem will depend on the chunking of your dataset, |
21 | 19 | and the distribution of group labels across those chunks. |
22 | 20 |
|
| 21 | +```{tip} |
| 22 | +Currently these strategieis are implemented for dask. We would like to generalize to other parallel array types |
| 23 | +as appropriate (e.g. Ramba, cubed, arkouda). Please open an issue to discuss if you are interested. |
| 24 | +``` |
| 25 | + |
23 | 26 | (xarray-split)= |
24 | 27 | ## Background: Xarray's current GroupBy strategy |
25 | 28 |
|
@@ -82,9 +85,10 @@ A bigger advantagee is that this approach allows grouping by a dask array so gro |
82 | 85 | For example, consider `groupby("time.month")` with monthly frequency data and chunksize of 4 along `time`. |
83 | 86 |  |
84 | 87 | With `reindex=True`, each block will become 3x its original size at the blockwise step: input blocks have 4 timesteps while output block |
85 | | -has a value for all 12 months. One could use `reindex=False` to control memory usage but also see [`method="cohorts"`](cohorts) below. |
| 88 | +has a value for all 12 months. One could use `reindex=False` to control memory usage but also see [`method="cohorts"`](method-cohorts) below. |
86 | 89 |
|
87 | 90 |
|
| 91 | +(method-blockwise)= |
88 | 92 | ## `method="blockwise"` |
89 | 93 |
|
90 | 94 | One case where `method="map-reduce"` doesn't work well is the case of "resampling" reductions. An |
@@ -113,6 +117,7 @@ so that all members of a group are in a single block. Then, the groupby operatio |
113 | 117 | 1. Works better when multiple groups are already in a single block; so that the intial |
114 | 118 | rechunking only involves a small amount of communication. |
115 | 119 |
|
| 120 | +(method-cohorts)= |
116 | 121 | ## `method="cohorts"` |
117 | 122 |
|
118 | 123 | The `map-reduce` strategy is quite effective but can involve some unnecessary communication. It can be possible to exploit |
|
0 commit comments