-
Notifications
You must be signed in to change notification settings - Fork 367
Select the series correctly in SeriesGroupBy APIs #1224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| @property | ||
| def _kdf(self) -> DataFrame: | ||
| return self._kser._kdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix is here.
|
|
||
| def __init__(self, kser: Series, by: List[Series], as_index: bool = True): | ||
| self._kser = kser | ||
| self._groupkeys = by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here.
13c0af0 to
721b171
Compare
Codecov Report
@@ Coverage Diff @@
## master #1224 +/- ##
=========================================
+ Coverage 95.19% 95.2% +<.01%
=========================================
Files 35 35
Lines 7263 7271 +8
=========================================
+ Hits 6914 6922 +8
Misses 349 349
Continue to review full report at Codecov.
|
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should refactor these classes since the base infrastructure is rather old. We should revisit and refine later.
| self._groupkeys_scols = [F.col(name_like_string(s.name)) for s in self._groupkeys] | ||
| self._agg_columns_scols = [F.col(name_like_string(s.name)) for s in self._agg_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F.col(s._internal.data_columns[0]) instead of F.col(name_like_string(s.name))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched but realised that some tests fail with that change. Seems like there's some cases when internal column names and series names are different:
>>> (ks.range(10).id + 1)._internal.data_columns
['(id + 1)']
>>> (ks.range(10).id + 1).name
'id'We use:
@property
def _kdf(self) -> DataFrame:
series = [self._kser] + [s for s in self._groupkeys if not s._equals(self._kser)]
return DataFrame(self._kser._kdf._internal.with_new_columns(series))which always aliases via using column_index at with_new_columns. So, it seems guaranteed to have the internal Spark column names same as s.name. I think using name here for this workaround is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. hm, maybe renaming the columns in with_new_columns was not a good idea. I'll fix it later.
databricks/koalas/groupby.py
Outdated
| return self._kser._kdf | ||
| # TODO: if names from _kser and _groupkeys are name, grouping key is just ignored. | ||
| # it can be a problem when both series have the same name but different operations. | ||
| series = [self._kser] + [s for s in self._groupkeys if s.name != self._kser.name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using s._equals(self._kser) which compares the ._scol._jcs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, seems it fails because the same names exist in the same DataFrame AnalysisException: Reference 'b' is ambiguous, could be: b, b.; ... but yeah I think it's better to fail fast for now rather than returning a wrong result.
| self.assert_eq(kdf.groupby(['b'])['a'].apply(lambda x: x).sort_index(), | ||
| pdf.groupby(['b'])['a'].apply(lambda x: x).sort_index()) | ||
| self.assert_eq(kdf.groupby(['b'])['b'].apply(lambda x: x).sort_index(), | ||
| pdf.groupby(['b'])['b'].apply(lambda x: x).sort_index()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we will not support this case for now, which I think it's fine (?).
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| self._groupkeys_scols = [F.col(name_like_string(s.name)) for s in self._groupkeys] | ||
| self._agg_columns_scols = [F.col(name_like_string(s.name)) for s in self._agg_columns] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I see. hm, maybe renaming the columns in with_new_columns was not a good idea. I'll fix it later.
|
Thanks! merging. |
|
Is this issue fixed? |
|
Yup, this will be available for the next week's release. |
…different values (#1233) A small followup of databricks/koalas#1224 and databricks/koalas#1229 Now, we can cover this case
Can be tested with the codes below:
Basic idea is that, it creates a DataFrame from Series to reuse existing implementations, and resolves the columns by name.