Skip to content

More efficient cohorts. #165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Oct 11, 2022
Merged

More efficient cohorts. #165

merged 16 commits into from
Oct 11, 2022

Conversation

dcherian
Copy link
Collaborator

@dcherian dcherian commented Oct 7, 2022

Closes #140

We apply the cohort "split" step after the blockwise reduction, then use
the tree reduction on each cohort.

We also use the .blocks accessor to index out blocks. This is still a
bit inefficient since we split by indexing out regular arrays, so we
could index out blocks that don't contain any cohort members. However,
because we are splitting after the blockwise reduction, the amount of
work duplication can be a lot less than splitting the bare array.

One side-effect is that "split-reduce" is now a synonym for "cohorts".
The reason is that find_group_cohorts returns a dict mapping blocks to
cohorts. We could invert that behaviour but I don't see any benefit to
trying to figure that out.

Closes #140

We apply the cohort "split" step after the blockwise reduction, then use
the tree reduction on each cohort.

We also use the `.blocks` accessor to index out blocks. This is still a
bit inefficient since we split by indexing out regular arrays, so we
could index out blocks that don't contain any cohort members. However,
because we are splitting _after_ the blockwise reduction, the amount of
work duplication can be a lot less than splitting the bare array.

One side-effect is that "split-reduce" is now a synonym for "cohorts".
The reason is that find_group_cohorts returns a dict mapping blocks to
cohorts. We could invert that behaviour but I don't see any benefit to
trying to figure that out.
@dcherian
Copy link
Collaborator Author

dcherian commented Oct 7, 2022

For this dataset:
image

I get (sort=False)

|                       | unoptimized | optimized |
|-----------------------+-------------+-----------|
| main cohorts          |        9563 |      1307 |
| this PR cohorts       |        6866 |      1351 |
| this PR after c48041c |        6425 |      1221 |
| map-reduce            |        5638 |      1018 |

Looking at graphs for the first two blocks:

  • This branch must show lower memory usage but I need a benchmark for this
  • This branch has a lot of blocks tasks because I extract blocks in a loop for advanced indexing. +I think that could be optimized a bit to reduce how much I loop (but not too hopeful)+
  • I could optimize to do a blockwise thing instead of tree-reduce where that makes sense but seems minor.
  • Note that by a simple num-tasks metric, map-reduce looks great but it involves a lot of unnecessary communication. We are trading off larger number of tasks for fewer transfers and communication load.

This branch graph

image

main branch graph

image

@dcherian
Copy link
Collaborator Author

dcherian commented Oct 8, 2022

For this array and reducing over the counties in the previous image:
image

|                  | map-reduce | cohorts   |
|------------------+------------+-----------|
| number of tasks  | 1856       | 6420      |
| compute time     | 2372.83 s  | 2075.00 s |
| deserialize time | 4.68 s     | 2.87 s    |
| transfer time    | 15.18 s    | 5.17 s    |
| total transfer   | 469.56 MB  | 164.44 MB |

So! transferring less data, and spending less time on those transfers, exactly as hoped. This was on my laptop so in theory, this might make a difference on cloud/HPC systems.

This is with reindex=True. I did find map-reduce with reindex=False to be faster, but with similar network transfers.

The compute time is massively dominated by the blockwise reduction with numpy-groupies, so optimizing that would be very impactful!

We sort the members in each cohort (as earlier).
And also by first label in each cohort. This means we preserve order as
much as possible, which should help when sorting the final result,
especially for resampling type operations.
This reverts commit 3754cff.

Again, I don't see any benefits to this.
@dcherian dcherian enabled auto-merge (squash) October 11, 2022 21:35
@dcherian dcherian merged commit 72dfc87 into main Oct 11, 2022
@dcherian dcherian deleted the better-cohorts branch October 11, 2022 21:42
dcherian added a commit that referenced this pull request Oct 17, 2022
* main: (29 commits)
  Major fix to subset_to_blocks (#173)
  Performance improvements for cohorts detection (#172)
  Remove split_out (#170)
  Deprecate resample_reduce (#169)
  More efficient cohorts. (#165)
  Allow specifying output dtype (#131)
  Add a dtype check for numpy arrays in assert_equal (#158)
  Update ci-additional.yaml (#167)
  Refactor before redoing cohorts (#164)
  Fix mypy errors in core.py (#150)
  Add link to numpy_groupies (#160)
  Bump codecov/codecov-action from 3.1.0 to 3.1.1 (#159)
  Use math.prod instead of np.prod (#157)
  Remove None output from _get_expected_groups (#152)
  Fix mypy errors in xarray.py, xrutils.py, cache.py (#144)
  Raise error if multiple by's are used with Ellipsis (#149)
  pre-commit autoupdate (#148)
  Add mypy ignores (#146)
  Get pre commit bot to update (#145)
  Remove duplicate examples headers (#147)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improved cohorts.
1 participant