Skip to content

Conversation

@ueshin
Copy link
Collaborator

@ueshin ueshin commented May 12, 2020

Fixing GroupBy.apply, filter, and head to ignore temp columns when ops from different DataFrames, otherwise the result includes the temp columns.

>>> kdf = ks.DataFrame({"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]})
>>> kkey = ks.Series([1, 1, 2, 3, 5, 8])
>>> kdf.groupby(kkey).apply(lambda x: x + x.min())
    a   b   c  __tmp_groupkey_0__
0   2   2   2                   2
5  12  16  72                  16
1   3   2   5                   2
3   8   6  32                   6
2   6   4  18                   4
4  10  10  50                  10

@ueshin ueshin requested a review from HyukjinKwon May 12, 2020 18:30
Comment on lines +1708 to +1712
agg_columns = [
kdf._kser_for(label)
for label in kdf._internal.column_labels
if label not in self._ignore_column_labels
]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

head has another issue. I'll submit a following PR and add tests there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I submitted #1490.

self._groupkeys = by
self._groupkeys_scols = [s.spark_column for s in self._groupkeys]
self._as_index = as_index
self._ignore_column_labels = ignore_column_labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no big deal but what about renaming it column_labels_to_exlcude or tmp_column_labels?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename it in #1491.

@HyukjinKwon
Copy link
Member

I will merge this to unblock but I think we should have a separate class that contained grouping related columns .. it's getting very, very complicated to read ..

@HyukjinKwon HyukjinKwon merged commit 6c72cc4 into databricks:master May 13, 2020
@ueshin ueshin deleted the groupby_opts_on_diff_frames branch May 13, 2020 07:03
@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 13, 2020

Sorry I wasn't clear. I agree with the structures and change here. I was talking about a different idea and the fact that we have many grouping related attributes that might have to group up separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants