More efficient cohorts. #165

dcherian · 2022-10-07T21:38:45Z

Closes #140

We apply the cohort "split" step after the blockwise reduction, then use
the tree reduction on each cohort.

We also use the .blocks accessor to index out blocks. This is still a
bit inefficient since we split by indexing out regular arrays, so we
could index out blocks that don't contain any cohort members. However,
because we are splitting after the blockwise reduction, the amount of
work duplication can be a lot less than splitting the bare array.

One side-effect is that "split-reduce" is now a synonym for "cohorts".
The reason is that find_group_cohorts returns a dict mapping blocks to
cohorts. We could invert that behaviour but I don't see any benefit to
trying to figure that out.

Closes #140 We apply the cohort "split" step after the blockwise reduction, then use the tree reduction on each cohort. We also use the `.blocks` accessor to index out blocks. This is still a bit inefficient since we split by indexing out regular arrays, so we could index out blocks that don't contain any cohort members. However, because we are splitting _after_ the blockwise reduction, the amount of work duplication can be a lot less than splitting the bare array. One side-effect is that "split-reduce" is now a synonym for "cohorts". The reason is that find_group_cohorts returns a dict mapping blocks to cohorts. We could invert that behaviour but I don't see any benefit to trying to figure that out.

dcherian · 2022-10-07T23:16:15Z

For this dataset:

I get (sort=False)

|                       | unoptimized | optimized |
|-----------------------+-------------+-----------|
| main cohorts          |        9563 |      1307 |
| this PR cohorts       |        6866 |      1351 |
| this PR after c48041c |        6425 |      1221 |
| map-reduce            |        5638 |      1018 |

Looking at graphs for the first two blocks:

This branch must show lower memory usage but I need a benchmark for this
This branch has a lot of blocks tasks because I extract blocks in a loop for advanced indexing. +I think that could be optimized a bit to reduce how much I loop (but not too hopeful)+
I could optimize to do a blockwise thing instead of tree-reduce where that makes sense but seems minor.
Note that by a simple num-tasks metric, map-reduce looks great but it involves a lot of unnecessary communication. We are trading off larger number of tasks for fewer transfers and communication load.

This branch graph

main branch graph

dcherian · 2022-10-08T02:14:33Z

For this array and reducing over the counties in the previous image:

|                  | map-reduce | cohorts   |
|------------------+------------+-----------|
| number of tasks  | 1856       | 6420      |
| compute time     | 2372.83 s  | 2075.00 s |
| deserialize time | 4.68 s     | 2.87 s    |
| transfer time    | 15.18 s    | 5.17 s    |
| total transfer   | 469.56 MB  | 164.44 MB |

So! transferring less data, and spending less time on those transfers, exactly as hoped. This was on my laptop so in theory, this might make a difference on cloud/HPC systems.

This is with reindex=True. I did find map-reduce with reindex=False to be faster, but with similar network transfers.

The compute time is massively dominated by the blockwise reduction with numpy-groupies, so optimizing that would be very impactful!

We sort the members in each cohort (as earlier). And also by first label in each cohort. This means we preserve order as much as possible, which should help when sorting the final result, especially for resampling type operations.

This reverts commit 3754cff. Again, I don't see any benefits to this.

* main: (29 commits) Major fix to subset_to_blocks (#173) Performance improvements for cohorts detection (#172) Remove split_out (#170) Deprecate resample_reduce (#169) More efficient cohorts. (#165) Allow specifying output dtype (#131) Add a dtype check for numpy arrays in assert_equal (#158) Update ci-additional.yaml (#167) Refactor before redoing cohorts (#164) Fix mypy errors in core.py (#150) Add link to numpy_groupies (#160) Bump codecov/codecov-action from 3.1.0 to 3.1.1 (#159) Use math.prod instead of np.prod (#157) Remove None output from _get_expected_groups (#152) Fix mypy errors in xarray.py, xrutils.py, cache.py (#144) Raise error if multiple by's are used with Ellipsis (#149) pre-commit autoupdate (#148) Add mypy ignores (#146) Get pre commit bot to update (#145) Remove duplicate examples headers (#147) ...

dcherian added 5 commits October 7, 2022 15:38

Fix for split_out > 1

628c64a

Fix test_find_group_cohorts

23e0ddf

More test fixes

bcd430f

small fixes

fa13992

dcherian and others added 3 commits October 7, 2022 19:29

Optimize subset_to_blocks

c48041c

Fix mypy

ed7f935

Merge branch 'main' into better-cohorts

94213d1

dcherian added 6 commits October 7, 2022 20:33

bugfix

04c7f41

Bring back split-reduce

3754cff

Sort cohorts at detection stage.

f7351d9

We sort the members in each cohort (as earlier). And also by first label in each cohort. This means we preserve order as much as possible, which should help when sorting the final result, especially for resampling type operations.

Revert "Bring back split-reduce"

67af7a5

This reverts commit 3754cff. Again, I don't see any benefits to this.

Don't sort output unless necessary

d277267

Remove split-reduce.

378cbe4

dcherian enabled auto-merge (squash) October 11, 2022 21:35

dcherian and others added 2 commits October 11, 2022 15:36

Fix typo

3724029

Merge branch 'main' into better-cohorts

fabf687

dcherian merged commit 72dfc87 into main Oct 11, 2022

dcherian deleted the better-cohorts branch October 11, 2022 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient cohorts. #165

More efficient cohorts. #165

dcherian commented Oct 7, 2022

dcherian commented Oct 7, 2022 •

edited

Loading

dcherian commented Oct 8, 2022

More efficient cohorts. #165

More efficient cohorts. #165

Conversation

dcherian commented Oct 7, 2022

dcherian commented Oct 7, 2022 • edited Loading

This branch graph

main branch graph

dcherian commented Oct 8, 2022

dcherian commented Oct 7, 2022 •

edited

Loading