Skip to content

Commit 09cd725

Browse files
Update markdown, notebook linting (#204)
* Update markdown, notebook linting * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update markdown, notebook linting * Fix climatology notebook Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 8d7ccbc commit 09cd725

13 files changed

+214
-1422
lines changed

.pre-commit-config.yaml

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -25,14 +25,24 @@ repos:
2525
hooks:
2626
- id: isort
2727

28-
- repo: https://github.com/deathbeds/prenotebook
29-
rev: f5bdb72a400f1a56fe88109936c83aa12cc349fa
28+
- repo: https://github.com/executablebooks/mdformat
29+
rev: 0.7.16
3030
hooks:
31-
- id: prenotebook
32-
args:
33-
[
34-
'--keep-output',
35-
'--keep-metadata',
36-
'--keep-execution-count',
37-
'--keep-empty',
38-
]
31+
- id: mdformat
32+
additional_dependencies:
33+
- mdformat-black
34+
- mdformat-myst
35+
36+
- repo: https://github.com/nbQA-dev/nbQA
37+
rev: 1.6.1
38+
hooks:
39+
- id: nbqa-black
40+
- id: nbqa-pyupgrade
41+
args: [--py37-plus]
42+
- id: nbqa-isort
43+
44+
- repo: https://github.com/kynan/nbstripout
45+
rev: 0.6.1
46+
hooks:
47+
- id: nbstripout
48+
args: [--extra-keys=metadata.kernelspec metadata.language_info.version]

README.md

Lines changed: 30 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,10 @@
1414
This project explores strategies for fast GroupBy reductions with dask.array. It used to be called `dask_groupby`
1515
It was motivated by
1616

17-
1. Dask Dataframe GroupBy
18-
[blogpost](https://blog.dask.org/2019/10/08/df-groupby)
19-
2. [numpy_groupies](https://github.com/ml31415/numpy-groupies) in Xarray
20-
[issue](https://github.com/pydata/xarray/issues/4473)
17+
1. Dask Dataframe GroupBy
18+
[blogpost](https://blog.dask.org/2019/10/08/df-groupby)
19+
1. [numpy_groupies](https://github.com/ml31415/numpy-groupies) in Xarray
20+
[issue](https://github.com/pydata/xarray/issues/4473)
2121

2222
(See a
2323
[presentation](https://docs.google.com/presentation/d/1YubKrwu9zPHC_CzVBhvORuQBW-z148BvX3Ne8XcvWsQ/edit?usp=sharing)
@@ -26,22 +26,23 @@ about this package, from the Pangeo Showcase).
2626
## Acknowledgements
2727

2828
This work was funded in part by
29+
2930
1. NASA-ACCESS 80NSSC18M0156 "Community tools for analysis of NASA Earth Observing System
30-
Data in the Cloud" (PI J. Hamman, NCAR),
31-
2. NASA-OSTFL 80NSSC22K0345 "Enhancing analysis of NASA data with the open-source Python Xarray Library" (PIs Scott Henderson, University of Washington; Deepak Cherian, NCAR; Jessica Scheick, University of New Hampshire), and
32-
3. [NCAR's Earth System Data Science Initiative](https://ncar.github.io/esds/).
31+
Data in the Cloud" (PI J. Hamman, NCAR),
32+
1. NASA-OSTFL 80NSSC22K0345 "Enhancing analysis of NASA data with the open-source Python Xarray Library" (PIs Scott Henderson, University of Washington; Deepak Cherian, NCAR; Jessica Scheick, University of New Hampshire), and
33+
1. [NCAR's Earth System Data Science Initiative](https://ncar.github.io/esds/).
3334

3435
It was motivated by [very](https://github.com/pangeo-data/pangeo/issues/266) [very](https://github.com/pangeo-data/pangeo/issues/271) [many](https://github.com/dask/distributed/issues/2602) [discussions](https://github.com/pydata/xarray/issues/2237) in the [Pangeo](https://pangeo.io) community.
3536

3637
## API
3738

3839
There are two main functions
39-
1. `flox.groupby_reduce(dask_array, by_dask_array, "mean")`
40-
"pure" dask array interface
41-
1. `flox.xarray.xarray_reduce(xarray_object, by_dataarray, "mean")`
42-
"pure" xarray interface; though [work is ongoing](https://github.com/pydata/xarray/pull/5734) to integrate this
43-
package in xarray.
4440

41+
1. `flox.groupby_reduce(dask_array, by_dask_array, "mean")`
42+
"pure" dask array interface
43+
1. `flox.xarray.xarray_reduce(xarray_object, by_dataarray, "mean")`
44+
"pure" xarray interface; though [work is ongoing](https://github.com/pydata/xarray/pull/5734) to integrate this
45+
package in xarray.
4546

4647
## Implementation
4748

@@ -53,21 +54,21 @@ See [the documentation](https://flox.readthedocs.io/en/latest/implementation.htm
5354
It also allows you to specify a custom Aggregation (again inspired by dask.dataframe),
5455
though this might not be fully functional at the moment. See `aggregations.py` for examples.
5556

56-
``` python
57-
mean = Aggregation(
58-
# name used for dask tasks
59-
name="mean",
60-
# operation to use for pure-numpy inputs
61-
numpy="mean",
62-
# blockwise reduction
63-
chunk=("sum", "count"),
64-
# combine intermediate results: sum the sums, sum the counts
65-
combine=("sum", "sum"),
66-
# generate final result as sum / count
67-
finalize=lambda sum_, count: sum_ / count,
68-
# Used when "reindexing" at combine-time
69-
fill_value=0,
70-
# Used when any member of `expected_groups` is not found
71-
final_fill_value=np.nan,
72-
)
57+
```python
58+
mean = Aggregation(
59+
# name used for dask tasks
60+
name="mean",
61+
# operation to use for pure-numpy inputs
62+
numpy="mean",
63+
# blockwise reduction
64+
chunk=("sum", "count"),
65+
# combine intermediate results: sum the sums, sum the counts
66+
combine=("sum", "sum"),
67+
# generate final result as sum / count
68+
finalize=lambda sum_, count: sum_ / count,
69+
# Used when "reindexing" at combine-time
70+
fill_value=0,
71+
# Used when any member of `expected_groups` is not found
72+
final_fill_value=np.nan,
73+
)
7374
```

asv_bench/benchmarks/README_CI.md

Lines changed: 20 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# Benchmark CI
22

33
<!-- Author: @jaimergp -->
4+
45
<!-- Last updated: 2021.07.06 -->
6+
57
<!-- Describes the work done as part of https://github.com/scikit-image/scikit-image/pull/5424 -->
68

79
## How it works
@@ -10,39 +12,39 @@ The `asv` suite can be run for any PR on GitHub Actions (check workflow `.github
1012

1113
We use `asv continuous` to run the job, which runs a relative performance measurement. This means that there's no state to be saved and that regressions are only caught in terms of performance ratio (absolute numbers are available but they are not useful since we do not use stable hardware over time). `asv continuous` will:
1214

13-
* Compile `scikit-image` for _both_ commits. We use `ccache` to speed up the process, and `mamba` is used to create the build environments.
14-
* Run the benchmark suite for both commits, _twice_ (since `processes=2` by default).
15-
* Generate a report table with performance ratios:
16-
* `ratio=1.0` -> performance didn't change.
17-
* `ratio<1.0` -> PR made it slower.
18-
* `ratio>1.0` -> PR made it faster.
15+
- Compile `scikit-image` for _both_ commits. We use `ccache` to speed up the process, and `mamba` is used to create the build environments.
16+
- Run the benchmark suite for both commits, _twice_ (since `processes=2` by default).
17+
- Generate a report table with performance ratios:
18+
- `ratio=1.0` -> performance didn't change.
19+
- `ratio<1.0` -> PR made it slower.
20+
- `ratio>1.0` -> PR made it faster.
1921

2022
Due to the sensitivity of the test, we cannot guarantee that false positives are not produced. In practice, values between `(0.7, 1.5)` are to be considered part of the measurement noise. When in doubt, running the benchmark suite one more time will provide more information about the test being a false positive or not.
2123

2224
## Running the benchmarks on GitHub Actions
2325

2426
1. On a PR, add the label `run-benchmark`.
25-
2. The CI job will be started. Checks will appear in the usual dashboard panel above the comment box.
26-
3. If more commits are added, the label checks will be grouped with the last commit checks _before_ you added the label.
27-
4. Alternatively, you can always go to the `Actions` tab in the repo and [filter for `workflow:Benchmark`](https://github.com/scikit-image/scikit-image/actions?query=workflow%3ABenchmark). Your username will be assigned to the `actor` field, so you can also filter the results with that if you need it.
27+
1. The CI job will be started. Checks will appear in the usual dashboard panel above the comment box.
28+
1. If more commits are added, the label checks will be grouped with the last commit checks _before_ you added the label.
29+
1. Alternatively, you can always go to the `Actions` tab in the repo and [filter for `workflow:Benchmark`](https://github.com/scikit-image/scikit-image/actions?query=workflow%3ABenchmark). Your username will be assigned to the `actor` field, so you can also filter the results with that if you need it.
2830

2931
## The artifacts
3032

3133
The CI job will also generate an artifact. This is the `.asv/results` directory compressed in a zip file. Its contents include:
3234

33-
* `fv-xxxxx-xx/`. A directory for the machine that ran the suite. It contains three files:
34-
* `<baseline>.json`, `<contender>.json`: the benchmark results for each commit, with stats.
35-
* `machine.json`: details about the hardware.
36-
* `benchmarks.json`: metadata about the current benchmark suite.
37-
* `benchmarks.log`: the CI logs for this run.
38-
* This README.
35+
- `fv-xxxxx-xx/`. A directory for the machine that ran the suite. It contains three files:
36+
- `<baseline>.json`, `<contender>.json`: the benchmark results for each commit, with stats.
37+
- `machine.json`: details about the hardware.
38+
- `benchmarks.json`: metadata about the current benchmark suite.
39+
- `benchmarks.log`: the CI logs for this run.
40+
- This README.
3941

4042
## Re-running the analysis
4143

4244
Although the CI logs should be enough to get an idea of what happened (check the table at the end), one can use `asv` to run the analysis routines again.
4345

4446
1. Uncompress the artifact contents in the repo, under `.asv/results`. This is, you should see `.asv/results/benchmarks.log`, not `.asv/results/something_else/benchmarks.log`. Write down the machine directory name for later.
45-
2. Run `asv show` to see your available results. You will see something like this:
47+
1. Run `asv show` to see your available results. You will see something like this:
4648

4749
```
4850
$> asv show
@@ -115,8 +117,10 @@ To minimize the time required to run the full suite, we trimmed the parameter ma
115117
```python
116118
from . import _skip_slow # this function is defined in benchmarks.__init__
117119

120+
118121
def time_something_slow():
119122
pass
120123

124+
121125
time_something.setup = _skip_slow
122126
```

docs/source/aggregations.md

Lines changed: 16 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ the `func` kwarg:
1414
- `"first"`
1515
- `"last"`
1616

17-
1817
```{tip}
1918
We would like to add support for `cumsum`, `cumprod` ([issue](https://github.com/xarray-contrib/flox/issues/91)). Contributions are welcome!
2019
```
@@ -27,20 +26,20 @@ though this might not be fully functional at the moment. See `aggregations.py` f
2726
See the ["Custom Aggregations"](user-stories/custom-aggregations.ipynb) user story for a more user-friendly example.
2827

2928
```python
30-
mean = Aggregation(
31-
# name used for dask tasks
32-
name="mean",
33-
# operation to use for pure-numpy inputs
34-
numpy="mean",
35-
# blockwise reduction
36-
chunk=("sum", "count"),
37-
# combine intermediate results: sum the sums, sum the counts
38-
combine=("sum", "sum"),
39-
# generate final result as sum / count
40-
finalize=lambda sum_, count: sum_ / count,
41-
# Used when "reindexing" at combine-time
42-
fill_value=0,
43-
# Used when any member of `expected_groups` is not found
44-
final_fill_value=np.nan,
45-
)
29+
mean = Aggregation(
30+
# name used for dask tasks
31+
name="mean",
32+
# operation to use for pure-numpy inputs
33+
numpy="mean",
34+
# blockwise reduction
35+
chunk=("sum", "count"),
36+
# combine intermediate results: sum the sums, sum the counts
37+
combine=("sum", "sum"),
38+
# generate final result as sum / count
39+
finalize=lambda sum_, count: sum_ / count,
40+
# Used when "reindexing" at combine-time
41+
fill_value=0,
42+
# Used when any member of `expected_groups` is not found
43+
final_fill_value=np.nan,
44+
)
4645
```

docs/source/arrays.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,8 @@
22

33
Aggregating over other array types will work if the array types supports the following methods, [ufunc.reduceat](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.reduceat.html) or [ufunc.at](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.at.html)
44

5-
65
| Reduction | `method="numpy"` | `method="flox"` |
7-
|--------------------------------|------------------|-------------------|
6+
| ------------------------------ | ---------------- | ----------------- |
87
| sum, nansum | bincount | add.reduceat |
98
| mean, nanmean | bincount | add.reduceat |
109
| var, nanvar | bincount | add.reduceat |

docs/source/engines.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
(engines)=
2+
23
# Engines
34

45
`flox` provides multiple options, using the `engine` kwarg, for computing the core GroupBy reduction on numpy or other array types other than dask.
@@ -7,13 +8,14 @@
78
(.e.g `np.maximum.at`) to provided reasonably performant aggregations.
89
1. `engine="numba"` wraps `numpy_groupies.aggregate_numba`. This uses `numba` kernels for the core aggregation.
910
1. `engine="flox"` uses the `ufunc.reduceat` method after first argsorting the array so that all group members occur sequentially. This was copied from
10-
a [gist by Stephan Hoyer](https://gist.github.com/shoyer/f538ac78ae904c936844)
11+
a [gist by Stephan Hoyer](https://gist.github.com/shoyer/f538ac78ae904c936844)
1112

1213
See [](arrays) for more details.
1314

1415
## Tradeoffs
1516

16-
For the common case of reducing a nD array by a 1D array of group labels (e.g. `groupby("time.month")`), `engine="flox"` *can* be faster.
17+
For the common case of reducing a nD array by a 1D array of group labels (e.g. `groupby("time.month")`), `engine="flox"` *can* be faster.
18+
1719
The reason is that `numpy_groupies` converts all groupby problems to a 1D problem, this can involve [some overhead](https://github.com/ml31415/numpy-groupies/pull/46).
1820
It is possible to optimize this a bit in `flox` or `numpy_groupies`, but the work has not been done yet.
1921
The advantage of `engine="numpy"` is that it tends to work for more array types, since it appears to be more common to implement `np.bincount`, and not `np.add.reduceat`.

0 commit comments

Comments
 (0)