Skip to content

Commit 8f2a50e

Browse files
committed
More updates
1 parent b225c4e commit 8f2a50e

File tree

9 files changed

+199
-150
lines changed

9 files changed

+199
-150
lines changed

docs/diagrams/new-map-reduce-reindex-False.svg

Lines changed: 125 additions & 125 deletions
Loading

docs/source/custom.md renamed to docs/source/aggregations.md

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,27 @@
1-
# Custom reductions
1+
# Aggregations
22

3-
`flox` implements all common reductions provided by `numpy_groupies` in `aggregations.py`.
4-
It also allows you to specify a custom Aggregation (again inspired by dask.dataframe),
3+
`flox` implements all common reductions provided by `numpy_groupies` in `aggregations.py`. Control this by passing
4+
the `func` kwarg:
5+
6+
- `"sum"`, `"nansum"`
7+
- `"prod"`, `"nanprod"`
8+
- `"count"` - number of non-NaN elements by group
9+
- `"mean"`, `"nanmean"`
10+
- `"var"`, `"nanvar"`
11+
- `"std"`, `"nanstd"`
12+
- `"argmin"`
13+
- `"argmax"`
14+
- `"first"`
15+
- `"last"`
16+
17+
18+
```{tip}
19+
We would like to add support for `cumsum`, `cumprod` ([issue](https://github.com/xarray-contrib/flox/issues/91)). Contributions are welcome!
20+
```
21+
22+
## Custom Aggregations
23+
24+
`flox` also allows you to specify a custom Aggregation (again inspired by dask.dataframe),
525
though this might not be fully functional at the moment. See `aggregations.py` for examples.
626

727
See the ["Custom Aggregations"](user-stories/custom-aggregations.ipynb) user story for a more user-friendly example.

docs/source/api.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Visualization
3030
:toctree: generated/
3131

3232
visualize.draw_mesh
33-
visualize.visualize_groups
33+
visualize.visualize_groups_1d
3434
visualize.visualize_cohorts_2d
3535

3636
Aggregation Objects

docs/source/arrays.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Duck Array Support
2+
3+
Aggregating over other array types will work if the array types supports the following methods:

docs/source/conf.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,8 @@
4343
]
4444

4545
extlinks = {
46-
"issue": ("https://github.com/xarray-contrib/flox/issues/%s", "GH#"),
47-
"pr": ("https://github.com/xarray-contrib/flox/pull/%s", "GH#"),
46+
"issue": ("https://github.com/xarray-contrib/flox/issues/%s", "GH#%s"),
47+
"pr": ("https://github.com/xarray-contrib/flox/pull/%s", "PR#%s"),
4848
}
4949

5050
templates_path = ["_templates"]
@@ -174,7 +174,7 @@
174174
"numpy": ("https://numpy.org/doc/stable", None),
175175
# "numba": ("https://numba.pydata.org/numba-doc/latest", None),
176176
"dask": ("https://docs.dask.org/en/latest", None),
177-
"xarray": ("http://xarray.pydata.org/en/stable/", None),
177+
"xarray": ("https://docs.xarray.dev/en/stable/", None),
178178
}
179179

180180
autosummary_generate = True

docs/source/engines.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,20 @@
1+
(engines)=
12
# Engines
3+
4+
`flox` provides multiple options, using the `engine` kwarg, for computing the core GroupBy reduction on numpy or other array types other than dask.
5+
6+
1. `engine="numpy"` wraps `numpy_groupies.aggregate_numpy`. This uses indexing tricks and functions like `np.bincount`, or the ufunc `.at` methods
7+
(.e.g `np.maximum.at`) to provided reasonably performant aggregations.
8+
1. `engine="numba"` wraps `numpy_groupies.aggregate_numba`. This uses `numba` kernels for the core aggregation.
9+
1. `engine="flox"` uses the `ufunc.reduceat` method after first argsorting the array so that all group members occur sequentially. This was copied from
10+
a [gist by Stephan Hoyer](https://gist.github.com/shoyer/f538ac78ae904c936844)
11+
12+
There are some tradeoffs here. For the common case of reducing a nD array by a 1D array of group labels (e.g. `groupby("time.month")`), `engine="flox"` *can* be faster.
13+
The reason is that `numpy_groupies` converts all groupby problems to a 1D problem, this can involve [some overhead](https://github.com/ml31415/numpy-groupies/pull/46).
14+
It is possible to optimize this a bit in `flox` or `numpy_groupies` (though the latter is harder).
15+
The advantage of `engine="numpy"` is that it tends to work for more array types, since it appears to be more common to implement `np.bincount`, and not `np.add.reduceat`.
16+
17+
```{tip}
18+
Other potential engines we could add are [`numbagg`](https://github.com/numbagg/numbagg) ([stalled PR here](https://github.com/xarray-contrib/flox/pull/72)) and [`datashader`](https://github.com/xarray-contrib/flox/issues/142).
19+
Both use numba for high-performance aggregations. Contributions or discussion is very welcome!
20+
```

docs/source/implementation.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,28 @@
11
(algorithms)=
22
# Parallel Algorithms
33

4-
`flox` outsources the core GroupBy operation to the vectorized implementations in
5-
[numpy_groupies](https://github.com/ml31415/numpy-groupies).
6-
7-
Running an efficient groupby reduction in parallel is hard, and strongly depends on how the
8-
groups are distributed amongst the blocks of an array.
4+
`flox` outsources the core GroupBy operation to the vectorized implementations controlled by the
5+
[`engine` kwarg](engines.md). Applying these implementations on a parallel array type like dask
6+
can be hard. Performance strongly depends on how the groups are distributed amongst the blocks of an array.
97

108
`flox` implements 4 strategies for grouped reductions, each is appropriate for a particular distribution of groups
119
among the blocks of a dask array. Switch between the various strategies by passing `method`
12-
and/or `reindex` to either {py:func}`flox.core.groupby_reduce` or `xarray_reduce`.
10+
and/or `reindex` to either {py:func}`flox.groupby_reduce` or {py:func}`flox.xarray.xarray_reduce`.
1311

1412
Your options are:
1513
1. `method="map-reduce"` with `reindex=False`
1614
1. `method="map-reduce"` with `reindex=True`
17-
1. `method="blockwise"`
18-
1. `method="cohorts"`
15+
1. [`method="blockwise"`](method-blockwise)
16+
1. [`method="cohorts"`](method-cohorts)
1917

2018
The most appropriate strategy for your problem will depend on the chunking of your dataset,
2119
and the distribution of group labels across those chunks.
2220

21+
```{tip}
22+
Currently these strategieis are implemented for dask. We would like to generalize to other parallel array types
23+
as appropriate (e.g. Ramba, cubed, arkouda). Please open an issue to discuss if you are interested.
24+
```
25+
2326
(xarray-split)=
2427
## Background: Xarray's current GroupBy strategy
2528

@@ -82,9 +85,10 @@ A bigger advantagee is that this approach allows grouping by a dask array so gro
8285
For example, consider `groupby("time.month")` with monthly frequency data and chunksize of 4 along `time`.
8386
![cohorts-schematic](/../diagrams/cohorts-month-chunk4.png)
8487
With `reindex=True`, each block will become 3x its original size at the blockwise step: input blocks have 4 timesteps while output block
85-
has a value for all 12 months. One could use `reindex=False` to control memory usage but also see [`method="cohorts"`](cohorts) below.
88+
has a value for all 12 months. One could use `reindex=False` to control memory usage but also see [`method="cohorts"`](method-cohorts) below.
8689

8790

91+
(method-blockwise)=
8892
## `method="blockwise"`
8993

9094
One case where `method="map-reduce"` doesn't work well is the case of "resampling" reductions. An
@@ -113,6 +117,7 @@ so that all members of a group are in a single block. Then, the groupby operatio
113117
1. Works better when multiple groups are already in a single block; so that the intial
114118
rechunking only involves a small amount of communication.
115119

120+
(method-cohorts)=
116121
## `method="cohorts"`
117122

118123
The `map-reduce` strategy is quite effective but can involve some unnecessary communication. It can be possible to exploit

docs/source/index.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -24,14 +24,14 @@ See a presentation ([video](https://discourse.pangeo.io/t/november-17-2021-flox-
2424

2525
## Why flox?
2626

27-
1. {py:func}`flox.groupby_reduce` wraps the `numpy-groupies` package for performant Groupby reductions on nD arrays.
28-
1. {py:func}`flox.groupby_reduce` provides [parallel-friendly strategies](algorithms) for GroupBy reductions by wrapping `numpy-groupies` for dask arrays.
29-
1. `flox` integrates with xarray to provide more performant Groupby and Resampling operations.
30-
1. {py:func}`flox.xarray.xarray_reduce` extends Xarray's GroupBy operations allowing lazy grouping by dask arrays, grouping by multiple arrays,
31-
as well as combining categorical grouping and histrogram-style binning operations using multiple variables.
27+
1. {py:func}`flox.groupby_reduce` [wraps](engines.md) the `numpy-groupies` package for performant Groupby reductions on nD arrays.
28+
1. {py:func}`flox.groupby_reduce` provides [parallel-friendly strategies](implementation.md) for GroupBy reductions by wrapping `numpy-groupies` for dask arrays.
29+
1. `flox` [integrates with xarray](xarray.md) to provide more performant Groupby and Resampling operations.
30+
1. {py:func}`flox.xarray.xarray_reduce` [extends](xarray.md) Xarray's GroupBy operations allowing lazy grouping by dask arrays, grouping by multiple arrays,
31+
as well as combining categorical grouping and histogram-style binning operations using multiple variables.
3232
1. `flox` also provides utility functions for rechunking both dask arrays and Xarray objects along a single dimension using the group labels as a guide:
33-
1. To rechunk for blockwise operations: {py:func}`flox.rechunk_for_blockwise`, {py:func}`flox.xarray.rechunk_for_blockwise`.
34-
1. To rechunk so that "cohorts", or groups of labels, tend to occur in the same chunks: {py:func}`flox.rechunk_for_cohorts`, {py:func}`flox.xarray.rechunk_for_cohorts`.
33+
1. To rechunk for blockwise operations: {py:func}`flox.rechunk_for_blockwise`, {py:func}`flox.xarray.rechunk_for_blockwise`.
34+
1. To rechunk so that "cohorts", or groups of labels, tend to occur in the same chunks: {py:func}`flox.rechunk_for_cohorts`, {py:func}`flox.xarray.rechunk_for_cohorts`.
3535

3636
## Installing
3737

@@ -59,9 +59,10 @@ It was motivated by many discussions in the [Pangeo](https://pangeo.io) communit
5959
.. toctree::
6060
:maxdepth: 1
6161
62-
implementation.md
62+
aggregations.md
6363
engines.md
64-
custom.md
64+
implementation.md
65+
arrays.md
6566
xarray.md
6667
api.rst
6768
user-stories.md

docs/source/xarray.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
(xarray)=
12
# Xarray
23

34
Xarray will use flox by default (if installed) for DataArrays containing numpy and dask arrays. The default choice is `method="cohorts"` which generalizes

0 commit comments

Comments
 (0)