Implement nested renaming for groupby agg #904

charlesdong1991 · 2019-10-07T21:00:12Z

This allows koalas to have nested renaming/selection:

kdf = ks.DataFrame({'A': [1, 1, 2, 2],
                   'B': [1, 2, 3, 4],
                   'C': [0.362, 0.227, 1.267, -0.562]},columns=['A', 'B', 'C'])

WIP: still need to add some tests and docstrings

ueshin · 2019-10-07T21:13:52Z

@charlesdong1991 Could you put the result of the example in the description?

charlesdong1991 · 2019-10-07T21:18:05Z

sorry, i actually didn't know how to add the result dataframe there, so didn't put it in description, but it is the same one in the test. I added a screenshot for this instead. @ueshin

ueshin · 2019-10-07T21:41:37Z

@charlesdong1991 thanks, the screenshot is fine.

harupy · 2019-10-08T01:42:58Z

@charlesdong1991
Is this related to #823 ?

charlesdong1991 · 2019-10-08T06:42:08Z

ahh, yeah, kinda first step to implement named aggregation. @harupy

codecov-io · 2019-10-08T09:32:09Z

Codecov Report

Merging #904 into master will decrease coverage by 0.01%.
The diff coverage is 96.29%.

@@            Coverage Diff             @@
##           master     #904      +/-   ##
==========================================
- Coverage    94.3%   94.28%   -0.02%     
==========================================
  Files          34       34              
  Lines        6213     6244      +31     
==========================================
+ Hits         5859     5887      +28     
- Misses        354      357       +3

Impacted Files	Coverage Δ
databricks/koalas/groupby.py	`91.13% <96.29%> (+0.29%)`	⬆️
databricks/koalas/series.py	`95.29% <0%> (-0.28%)`	⬇️
databricks/koalas/missing/series.py	`100% <0%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e38e54...224dc30. Read the comment docs.

ueshin

Could you update doc string? Also doctests would be great?

ueshin · 2019-10-08T12:42:06Z

databricks/koalas/groupby.py

+    order = []
+    columns, pairs = list(zip(*kwargs.items()))
+
+    for name, (column, aggfunc) in zip(columns, pairs):


Seems like name is not used?

ahh, yeah, wired that my editor didn't complain 😅

@charlesdong1991
I think your code can be simplified as follows:

for column, aggfunc in pairs: if column in aggspec: aggspec[column].append(aggfunc) else: aggspec[column] = [aggfunc] return aggspec, columns, pairs

yeah, seems not needed to zip columns.

thanks, this is my bad, i shouldn't be lazy at this. @harupy

ueshin · 2019-10-08T15:25:46Z

databricks/koalas/groupby.py

        1    1    2  0.227  0.362
        2    3    4 -0.562  1.267
+
+        To control the output names with different aggregations per column, koalas


koalas -> Koalas

ueshin · 2019-10-08T15:33:39Z

databricks/koalas/tests/test_groupby.py

+
+        # this is only applied in version after 0.25.0
+        if pd.__version__ < "0.25.0":
+            return


Could you use @unittest.skipIf( ... ) instead?

ueshin · 2019-10-08T15:39:38Z

databricks/koalas/tests/test_groupby.py

+
+    @staticmethod
+    def test_is_multi_agg_with_relabel():
+        from databricks.koalas.groupby import _is_multi_agg_with_relabel


Shall we move this to the file header?

ueshin

Otherwise, LGTM.

softagram-bot · 2019-10-08T17:57:22Z

Softagram Impact Report for pull/904 (head commit: `224dc30`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/904

Impact Report explained. Give feedback on this report to [email protected]

ueshin · 2019-10-09T11:36:46Z

Hmm, I found a problem with multi-index columns:

>>> kdf = ks.DataFrame({"group": ['a', 'a', 'b', 'b'], "A": [0, 1, 2, 3], "B": [5, 6, 7, 8]})
>>> kdf.columns = pd.MultiIndex.from_tuples([('x', 'group'), ('y', 'A'), ('y', 'B')])
>>> kdf
      x  y
  group  A  B
0     a  0  5
1     a  1  6
2     b  2  7
3     b  3  8
>>> kdf.groupby(('x', 'group')).agg(a_max=(('y', 'A'), "max"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/groupby.py", line 179, in aggregate
    kdf = kdf[order]
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/frame.py", line 7272, in __getitem__
    return self._pd_getitem(key)
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/frame.py", line 7208, in _pd_getitem
    return self.loc[:, key]
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/indexing.py", line 447, in __getitem__
    raise ValueError('All the key level should be the same as column index level.')
ValueError: All the key level should be the same as column index level.

@charlesdong1991 Is it possible to support this case?

charlesdong1991 · 2019-10-09T11:43:01Z

i don't think pandas supports such case. I haven't checked, will check it once i am back home. Or we could have it in a separate PR if you agree. @ueshin

If pandas doesn't support it, i will create PR to solve it there first and then implement it here.

ueshin · 2019-10-09T12:04:04Z

I'm fine with having it in a separate PR. I'd merge this now.

ueshin · 2019-10-09T12:04:12Z

Thanks! merging.

charlesdong1991 added 2 commits October 7, 2019 22:55

add nested renaming for agg

9ce0829

add docstring and comments

fe89b93

charlesdong1991 added 2 commits October 8, 2019 08:36

fix bug

65637dc

add tests

d39be56

charlesdong1991 added 2 commits October 8, 2019 09:02

fix test

35dbdef

add pd version to skip test

d7acb67

charlesdong1991 marked this pull request as ready for review October 8, 2019 09:39

ueshin reviewed Oct 8, 2019

View reviewed changes

charlesdong1991 added 2 commits October 8, 2019 16:34

add docstring and doctest

a792440

simplify code

08a6174

ueshin reviewed Oct 8, 2019

View reviewed changes

charlesdong1991 added 2 commits October 8, 2019 19:46

code change on review

e8ddb16

fix test

224dc30

ueshin merged commit c3c196c into databricks:master Oct 9, 2019

Implement nested renaming for groupby agg #904

Implement nested renaming for groupby agg #904

Uh oh!

Conversation

charlesdong1991 commented Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ueshin commented Oct 7, 2019

Uh oh!

charlesdong1991 commented Oct 7, 2019

Uh oh!

ueshin commented Oct 7, 2019

Uh oh!

harupy commented Oct 8, 2019

Uh oh!

charlesdong1991 commented Oct 8, 2019

Uh oh!

codecov-io commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charlesdong1991 Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

softagram-bot commented Oct 8, 2019

Softagram Impact Report for pull/904 (head commit: 224dc30)

⭐ Change Overview

📄 Full report

Uh oh!

ueshin commented Oct 9, 2019

Uh oh!

charlesdong1991 commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ueshin commented Oct 9, 2019

Uh oh!

ueshin commented Oct 9, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

charlesdong1991 commented Oct 7, 2019 •

edited

Loading

codecov-io commented Oct 8, 2019 •

edited

Loading

charlesdong1991 Oct 8, 2019 •

edited

Loading

Softagram Impact Report for pull/904 (head commit: `224dc30`)

charlesdong1991 commented Oct 9, 2019 •

edited

Loading