-
Notifications
You must be signed in to change notification settings - Fork 367
Implement nested renaming for groupby agg #904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@charlesdong1991 Could you put the result of the example in the description? |
|
sorry, i actually didn't know how to add the result dataframe there, so didn't put it in description, but it is the same one in the test. I added a screenshot for this instead. @ueshin |
|
@charlesdong1991 thanks, the screenshot is fine. |
|
@charlesdong1991 |
|
ahh, yeah, kinda first step to implement named aggregation. @harupy |
Codecov Report
@@ Coverage Diff @@
## master #904 +/- ##
==========================================
- Coverage 94.3% 94.28% -0.02%
==========================================
Files 34 34
Lines 6213 6244 +31
==========================================
+ Hits 5859 5887 +28
- Misses 354 357 +3
Continue to review full report at Codecov.
|
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update doc string? Also doctests would be great?
databricks/koalas/groupby.py
Outdated
| order = [] | ||
| columns, pairs = list(zip(*kwargs.items())) | ||
|
|
||
| for name, (column, aggfunc) in zip(columns, pairs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like name is not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh, yeah, wired that my editor didn't complain 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charlesdong1991
I think your code can be simplified as follows:
for column, aggfunc in pairs:
if column in aggspec:
aggspec[column].append(aggfunc)
else:
aggspec[column] = [aggfunc]
return aggspec, columns, pairsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, seems not needed to zip columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, this is my bad, i shouldn't be lazy at this. @harupy
databricks/koalas/groupby.py
Outdated
| 1 1 2 0.227 0.362 | ||
| 2 3 4 -0.562 1.267 | ||
| To control the output names with different aggregations per column, koalas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
koalas -> Koalas
|
|
||
| # this is only applied in version after 0.25.0 | ||
| if pd.__version__ < "0.25.0": | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use @unittest.skipIf( ... ) instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
|
|
||
| @staticmethod | ||
| def test_is_multi_agg_with_relabel(): | ||
| from databricks.koalas.groupby import _is_multi_agg_with_relabel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we move this to the file header?
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
Softagram Impact Report for pull/904 (head commit: 224dc30)⭐ Change Overview
📄 Full report
Impact Report explained. Give feedback on this report to [email protected] |
|
Hmm, I found a problem with multi-index columns: >>> kdf = ks.DataFrame({"group": ['a', 'a', 'b', 'b'], "A": [0, 1, 2, 3], "B": [5, 6, 7, 8]})
>>> kdf.columns = pd.MultiIndex.from_tuples([('x', 'group'), ('y', 'A'), ('y', 'B')])
>>> kdf
x y
group A B
0 a 0 5
1 a 1 6
2 b 2 7
3 b 3 8
>>> kdf.groupby(('x', 'group')).agg(a_max=(('y', 'A'), "max"))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/groupby.py", line 179, in aggregate
kdf = kdf[order]
File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/frame.py", line 7272, in __getitem__
return self._pd_getitem(key)
File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/frame.py", line 7208, in _pd_getitem
return self.loc[:, key]
File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/indexing.py", line 447, in __getitem__
raise ValueError('All the key level should be the same as column index level.')
ValueError: All the key level should be the same as column index level.@charlesdong1991 Is it possible to support this case? |
|
i don't think pandas supports such case. I haven't checked, will check it once i am back home. Or we could have it in a separate PR if you agree. @ueshin If pandas doesn't support it, i will create PR to solve it there first and then implement it here. |
|
I'm fine with having it in a separate PR. I'd merge this now. |
|
Thanks! merging. |

This allows koalas to have nested renaming/selection:
WIP: still need to add some tests and docstrings