-
Notifications
You must be signed in to change notification settings - Fork 367
Add basic merge functionality to DataFrame #264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #264 +/- ##
==========================================
+ Coverage 92.35% 92.95% +0.59%
==========================================
Files 35 35
Lines 3205 3263 +58
==========================================
+ Hits 2960 3033 +73
+ Misses 245 230 -15
Continue to review full report at Codecov.
|
| # Assert full outer join also works with 'full' keyword | ||
| res = left_kdf.merge(right_kdf, how='full') | ||
| # FIXME Replace None with np.nan once #263 is solved | ||
| self.assert_eq(res, pd.DataFrame({'A': [1, 2, np.nan], 'B': [None, 'x', 'y']})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add tests to use suffixes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done with 1407922.
|
|
||
| # Assert inner join | ||
| res = left_kdf.merge(right_kdf) | ||
| self.assert_eq(res, pd.DataFrame({'A': [2], 'B': ['x']})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this behavior following pandas'?
I might miss something, but I got an error from pandas:
>>> import pandas as pd
>>> pd.__version__
'0.24.2'
>>> left_pdf = pd.DataFrame({'A': [1, 2]})
>>> right_pdf = pd.DataFrame({'B': ['x', 'y']}, index=[1, 2])
>>> left_pdf.merge(right_pdf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../pandas/core/frame.py", line 6868, in merge
copy=copy, indicator=indicator, validate=validate)
File "/.../pandas/core/reshape/merge.py", line 47, in merge
validate=validate)
File "/.../pandas/core/reshape/merge.py", line 524, in __init__
self._validate_specification()
File "/.../pandas/core/reshape/merge.py", line 1033, in _validate_specification
lidx=self.left_index, ridx=self.right_index))
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
Btw, I think basically we should use exactly the same pandas code as the counterpart of the assert_eq().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Without specifying any columns to join on, the current implementation defaults to the pandas equivalent of left_pdf.merge(right_pdf, left_index=True, right_index=True). Would you prefer the addition of those parameters and the respective MergeError to follow the behavior of pandas?
Regarding the tests, I totally agree with you. But because of #263, it currently fails as some data types mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need to add the parameters here, but at least we should respect the default value of pandas, e.g., assuming left_index=False, and right_index=False.
As for the MergeError, hmm, interesting.
I remember I added SparkPandasIndexingError for pandas IndexingError before. Maybe we can add SparkPandasMergeError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now added the SparkPandasMergeError with abfe0c9.
databricks/koalas/frame.py
Outdated
| >>> left_kdf = ks.DataFrame({'A': [1, 2]}) | ||
| >>> right_kdf = ks.DataFrame({'B': ['x', 'y']}, index=[1, 2]) | ||
| >>> left_kdf.merge(right_kdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case what is the join key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is outdated. Now it should be
left_kdf.merge(right_kdf, left_index=True, right_index=True)to join on the indices.
databricks/koalas/frame.py
Outdated
| raise SparkPandasMergeError("Only 'on' or 'left_index' and 'right_index' can be set") | ||
|
|
||
| if how == 'full': | ||
| print("Warning: While Koalas will accept 'full', you should use 'outer' instead to", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use warnings package like warnings.warn("...", UserWarning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've changed it to warnings.warn(...) with 4e7c429.
|
Looks good otherwise. Let me leave the final check to @ueshin . |
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of nits, otherwise LGTM.
databricks/koalas/frame.py
Outdated
| raise SparkPandasMergeError("Only 'on' or 'left_index' and 'right_index' can be set") | ||
|
|
||
| if how == 'full': | ||
| print("Warning: While Koalas will accept 'full', you should use 'outer' instead to", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need one more space at the end of the string: instead to "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When comma-separating strings in a print statement, Python will automatically add spaces in between. However, since 4e7c429 moves from print() to warnings.warn(), this has been taken care of.
databricks/koalas/frame.py
Outdated
| how = 'full' | ||
| if how not in ('inner', 'left', 'right', 'full'): | ||
| raise ValueError("The 'how' parameter has to be amongst the following values: ", | ||
| "['inner', 'left', 'right', 'full', 'outer']") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should include 'full' in the message since it is optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I've removed the full option with 0ad39b8.
|
Thanks! merging to master. |
| raise SparkPandasMergeError("At least 'on' or 'left_index' and 'right_index' have ", | ||
| "to be set") | ||
| if on is not None and (left_index or right_index): | ||
| raise SparkPandasMergeError("Only 'on' or 'left_index' and 'right_index' can be set") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@floscha should we just raise ValueError rather than introducing a new exception type just for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main reason for introducing a new SparkPandasMergeError was the fact that pandas also has a dedicated MergeError instead of using the generic ValueError. That being said, I personally wouldn't mind using a ValueError in this case either 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Can you submit a PR to just remove this? I find it odd to have a special Error type just for this check .... since it's a different error from pandas' anyway, it's not going to make a difference from the api compatibility point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've just opened #323 to remove it.
This PR adds the basic merge functionality from pandas to Koalas DataFrames as requested in #232. While the PR does not cover 100% of pandas' functionality, it covers the basic use cases to be useful enough to start with.
Resolves #232