Skip to content

Conversation

@charlesdong1991
Copy link
Contributor

@charlesdong1991 charlesdong1991 commented Oct 7, 2019

This allows koalas to have nested renaming/selection:

kdf = ks.DataFrame({'A': [1, 1, 2, 2],
                   'B': [1, 2, 3, 4],
                   'C': [0.362, 0.227, 1.267, -0.562]},columns=['A', 'B', 'C'])

Screen Shot 2019-10-07 at 11 15 22 PM

WIP: still need to add some tests and docstrings

@ueshin
Copy link
Collaborator

ueshin commented Oct 7, 2019

@charlesdong1991 Could you put the result of the example in the description?

@charlesdong1991
Copy link
Contributor Author

sorry, i actually didn't know how to add the result dataframe there, so didn't put it in description, but it is the same one in the test. I added a screenshot for this instead. @ueshin

@ueshin
Copy link
Collaborator

ueshin commented Oct 7, 2019

@charlesdong1991 thanks, the screenshot is fine.

@harupy
Copy link
Contributor

harupy commented Oct 8, 2019

@charlesdong1991
Is this related to #823 ?

@charlesdong1991
Copy link
Contributor Author

ahh, yeah, kinda first step to implement named aggregation. @harupy

@codecov-io
Copy link

codecov-io commented Oct 8, 2019

Codecov Report

Merging #904 into master will decrease coverage by 0.01%.
The diff coverage is 96.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #904      +/-   ##
==========================================
- Coverage    94.3%   94.28%   -0.02%     
==========================================
  Files          34       34              
  Lines        6213     6244      +31     
==========================================
+ Hits         5859     5887      +28     
- Misses        354      357       +3
Impacted Files Coverage Δ
databricks/koalas/groupby.py 91.13% <96.29%> (+0.29%) ⬆️
databricks/koalas/series.py 95.29% <0%> (-0.28%) ⬇️
databricks/koalas/missing/series.py 100% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9e38e54...224dc30. Read the comment docs.

@charlesdong1991 charlesdong1991 marked this pull request as ready for review October 8, 2019 09:39
Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update doc string? Also doctests would be great?

order = []
columns, pairs = list(zip(*kwargs.items()))

for name, (column, aggfunc) in zip(columns, pairs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like name is not used?

Copy link
Contributor Author

@charlesdong1991 charlesdong1991 Oct 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, yeah, wired that my editor didn't complain 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesdong1991
I think your code can be simplified as follows:

for column, aggfunc in pairs:
    if column in aggspec:
        aggspec[column].append(aggfunc)
    else:
        aggspec[column] = [aggfunc]

return aggspec, columns, pairs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, seems not needed to zip columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, this is my bad, i shouldn't be lazy at this. @harupy

1 1 2 0.227 0.362
2 3 4 -0.562 1.267
To control the output names with different aggregations per column, koalas
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

koalas -> Koalas


# this is only applied in version after 0.25.0
if pd.__version__ < "0.25.0":
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use @unittest.skipIf( ... ) instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!


@staticmethod
def test_is_multi_agg_with_relabel():
from databricks.koalas.groupby import _is_multi_agg_with_relabel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move this to the file header?

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM.

@softagram-bot
Copy link

Softagram Impact Report for pull/904 (head commit: 224dc30)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Impact Report explained. Give feedback on this report to [email protected]

@ueshin
Copy link
Collaborator

ueshin commented Oct 9, 2019

Hmm, I found a problem with multi-index columns:

>>> kdf = ks.DataFrame({"group": ['a', 'a', 'b', 'b'], "A": [0, 1, 2, 3], "B": [5, 6, 7, 8]})
>>> kdf.columns = pd.MultiIndex.from_tuples([('x', 'group'), ('y', 'A'), ('y', 'B')])
>>> kdf
      x  y
  group  A  B
0     a  0  5
1     a  1  6
2     b  2  7
3     b  3  8
>>> kdf.groupby(('x', 'group')).agg(a_max=(('y', 'A'), "max"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/groupby.py", line 179, in aggregate
    kdf = kdf[order]
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/frame.py", line 7272, in __getitem__
    return self._pd_getitem(key)
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/frame.py", line 7208, in _pd_getitem
    return self.loc[:, key]
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/indexing.py", line 447, in __getitem__
    raise ValueError('All the key level should be the same as column index level.')
ValueError: All the key level should be the same as column index level.

@charlesdong1991 Is it possible to support this case?

@charlesdong1991
Copy link
Contributor Author

charlesdong1991 commented Oct 9, 2019

i don't think pandas supports such case. I haven't checked, will check it once i am back home. Or we could have it in a separate PR if you agree. @ueshin

If pandas doesn't support it, i will create PR to solve it there first and then implement it here.

@ueshin
Copy link
Collaborator

ueshin commented Oct 9, 2019

I'm fine with having it in a separate PR. I'd merge this now.

@ueshin
Copy link
Collaborator

ueshin commented Oct 9, 2019

Thanks! merging.

@ueshin ueshin merged commit c3c196c into databricks:master Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants