-
Notifications
You must be signed in to change notification settings - Fork 18
More efficient cohorts. #165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Closes #140 We apply the cohort "split" step after the blockwise reduction, then use the tree reduction on each cohort. We also use the `.blocks` accessor to index out blocks. This is still a bit inefficient since we split by indexing out regular arrays, so we could index out blocks that don't contain any cohort members. However, because we are splitting _after_ the blockwise reduction, the amount of work duplication can be a lot less than splitting the bare array. One side-effect is that "split-reduce" is now a synonym for "cohorts". The reason is that find_group_cohorts returns a dict mapping blocks to cohorts. We could invert that behaviour but I don't see any benefit to trying to figure that out.
I get (
Looking at graphs for the first two blocks:
This branch graphmain branch graph |
We sort the members in each cohort (as earlier). And also by first label in each cohort. This means we preserve order as much as possible, which should help when sorting the final result, especially for resampling type operations.
This reverts commit 3754cff. Again, I don't see any benefits to this.
dcherian
added a commit
that referenced
this pull request
Oct 17, 2022
* main: (29 commits) Major fix to subset_to_blocks (#173) Performance improvements for cohorts detection (#172) Remove split_out (#170) Deprecate resample_reduce (#169) More efficient cohorts. (#165) Allow specifying output dtype (#131) Add a dtype check for numpy arrays in assert_equal (#158) Update ci-additional.yaml (#167) Refactor before redoing cohorts (#164) Fix mypy errors in core.py (#150) Add link to numpy_groupies (#160) Bump codecov/codecov-action from 3.1.0 to 3.1.1 (#159) Use math.prod instead of np.prod (#157) Remove None output from _get_expected_groups (#152) Fix mypy errors in xarray.py, xrutils.py, cache.py (#144) Raise error if multiple by's are used with Ellipsis (#149) pre-commit autoupdate (#148) Add mypy ignores (#146) Get pre commit bot to update (#145) Remove duplicate examples headers (#147) ...
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #140
We apply the cohort "split" step after the blockwise reduction, then use
the tree reduction on each cohort.
We also use the
.blocks
accessor to index out blocks. This is still abit inefficient since we split by indexing out regular arrays, so we
could index out blocks that don't contain any cohort members. However,
because we are splitting after the blockwise reduction, the amount of
work duplication can be a lot less than splitting the bare array.
One side-effect is that "split-reduce" is now a synonym for "cohorts".
The reason is that find_group_cohorts returns a dict mapping blocks to
cohorts. We could invert that behaviour but I don't see any benefit to
trying to figure that out.