Fix median miscalculation for even-sized item list#2224
Fix median miscalculation for even-sized item list#2224texodus merged 4 commits intoperspective-dev:masterfrom
Conversation
|
Thank you for your contribution and Welcome to our Open Source Community! To make sure your pull request is accepted successfully, we ask all our open source contributors to sign a Contributor License Agreement; having reviewed our contributor list, we require a CLA for the following people : (@adetsi). In order to sign a CLA with FINOS, just submit a Pull Request with a simple change to this file (adding an empty line, or some random text at the bottom); this will trigger the EasyCLA bot, which will post a comment to the Pull Request stating whether all PR contributors are covered by FINOS CLA; if not covered, the bot will post instructions on how to sign the CLA. Thanks once again for your contribution. Let us work with you to make the CLA process quick, easy and efficient so we can move forward with reviewing and accepting your pull request. Feel free to email help@finos.org for any questions. |
|
@cla-bot[bot] check |
|
The cla-bot has been summoned, and re-checked this pull request! |
|
@adetsi There is already an open PR for this issue #2197. That said, neither this PR nor #2197 actually address the issue I described in my response to #1928, which is that the |
|
Thanks for the clarification, I have added implementations to accommodate for non-numeric types as well (preserves the old behavior). |
|
@texodus |
There was a problem hiding this comment.
Thanks for the update @adetsi. This implementation is better but the failing tests indicate another issue, that median can't be calculated for integer columns without changing the aggregate column's output type to float; trying to calculate the median without this will generate all 0's. In order to do this, you'd need to set t_tscalar median_average explicitly, as well as setting the associated metadata for the schema transition in a few other places around the codebase. (this is done already for e.g. count which goes from any type to integer, if you need an example to grep for).
In lieu of this, it would be easier to just implement this behavior for float columns instead of numeric ones, by changing the feature-check from is_numeric to is_floating_point. integer columns will then stay integer when applied median() and exhibit the old behavior, while columns which are float will use the new behavior. This would be an acceptable alternative if the former implementation for integer-column-type-upgrading is too complex (it may involve updating python and the UI as well).
See inline review notes below, there are also a few small style and efficiency improvements, and we should preserve the current test inputs and their integer columns.
EDIT
I've implement these changes here for you to cherry-pick if you choose.
|
|
||
| median_average.set((*first_middle + *second_middle) / static_cast<t_tscalar>(2)); | ||
| return median_average; | ||
| }else{ |
There was a problem hiding this comment.
Formatting - please run clang-format.
| int size = values.size(); | ||
| bool is_even_size = size % 2 == 0; | ||
|
|
||
| if (is_even_size && values[0].is_numeric()){ |
There was a problem hiding this comment.
See review summary - is_numeric() case here should be exclusively float column types via is_floating_point().
| std::vector<t_tscalar>::iterator first_middle = values.begin() + ((size - 1) / 2); | ||
| std::vector<t_tscalar>::iterator second_middle = values.begin() + (size / 2); | ||
|
|
||
| nth_element(values.begin(), first_middle, values.end()); |
There was a problem hiding this comment.
nth_element() does not need to be called twice here, the column is guaranteed to be even and min 2, so this is equivalent to *(second_middle - 1).
| values.begin(), middle, values.end()); | ||
|
|
||
| return *middle; | ||
| return get_aggregate_median(values); |
There was a problem hiding this comment.
Just a suggestion - if we're going to factor out this logic into a function called get_aggregate_median(values), maybe we should also move the 0 and 1 cases into this definition to 1) make it complete and 2) move the entire closure body to a single method call.
|
@texodus thank you for the review and comments, I hope everything looks good now. |
|
Thanks for the PR @adetsi! Looks good! |
Fix for issue (#1928) where median value for even-sized item list is miscalculated, see below.
Before
After