Skip to content

Conversation

@floscha
Copy link
Contributor

@floscha floscha commented May 8, 2019

This PR adds the basic merge functionality from pandas to Koalas DataFrames as requested in #232. While the PR does not cover 100% of pandas' functionality, it covers the basic use cases to be useful enough to start with.

Resolves #232

@codecov-io
Copy link

codecov-io commented May 8, 2019

Codecov Report

Merging #264 into master will increase coverage by 0.59%.
The diff coverage is 97.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
+ Coverage   92.35%   92.95%   +0.59%     
==========================================
  Files          35       35              
  Lines        3205     3263      +58     
==========================================
+ Hits         2960     3033      +73     
+ Misses        245      230      -15
Impacted Files Coverage Δ
databricks/koalas/missing/frame.py 100% <ø> (ø) ⬆️
databricks/koalas/exceptions.py 79.41% <100%> (+1.28%) ⬆️
databricks/koalas/tests/test_dataframe.py 100% <100%> (ø) ⬆️
databricks/koalas/frame.py 94.76% <92.3%> (+2.32%) ⬆️
databricks/koalas/dask/utils.py 66.66% <0%> (-6.67%) ⬇️
databricks/koalas/generic.py 94.93% <0%> (-0.07%) ⬇️
databricks/koalas/namespace.py 90.34% <0%> (ø) ⬆️
databricks/koalas/series.py 91.72% <0%> (+0.44%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a888dfd...0ad39b8. Read the comment docs.

# Assert full outer join also works with 'full' keyword
res = left_kdf.merge(right_kdf, how='full')
# FIXME Replace None with np.nan once #263 is solved
self.assert_eq(res, pd.DataFrame({'A': [1, 2, np.nan], 'B': [None, 'x', 'y']}))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests to use suffixes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with 1407922.


# Assert inner join
res = left_kdf.merge(right_kdf)
self.assert_eq(res, pd.DataFrame({'A': [2], 'B': ['x']}))
Copy link
Collaborator

@ueshin ueshin May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behavior following pandas'?
I might miss something, but I got an error from pandas:

>>> import pandas as pd
>>> pd.__version__
'0.24.2'
>>> left_pdf = pd.DataFrame({'A': [1, 2]})
>>> right_pdf = pd.DataFrame({'B': ['x', 'y']}, index=[1, 2])
>>> left_pdf.merge(right_pdf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../pandas/core/frame.py", line 6868, in merge
    copy=copy, indicator=indicator, validate=validate)
  File "/.../pandas/core/reshape/merge.py", line 47, in merge
    validate=validate)
  File "/.../pandas/core/reshape/merge.py", line 524, in __init__
    self._validate_specification()
  File "/.../pandas/core/reshape/merge.py", line 1033, in _validate_specification
    lidx=self.left_index, ridx=self.right_index))
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

Btw, I think basically we should use exactly the same pandas code as the counterpart of the assert_eq().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Without specifying any columns to join on, the current implementation defaults to the pandas equivalent of left_pdf.merge(right_pdf, left_index=True, right_index=True). Would you prefer the addition of those parameters and the respective MergeError to follow the behavior of pandas?

Regarding the tests, I totally agree with you. But because of #263, it currently fails as some data types mismatch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to add the parameters here, but at least we should respect the default value of pandas, e.g., assuming left_index=False, and right_index=False.

As for the MergeError, hmm, interesting.
I remember I added SparkPandasIndexingError for pandas IndexingError before. Maybe we can add SparkPandasMergeError.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now added the SparkPandasMergeError with abfe0c9.

>>> left_kdf = ks.DataFrame({'A': [1, 2]})
>>> right_kdf = ks.DataFrame({'B': ['x', 'y']}, index=[1, 2])
>>> left_kdf.merge(right_kdf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case what is the join key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is outdated. Now it should be

left_kdf.merge(right_kdf, left_index=True, right_index=True)

to join on the indices.

raise SparkPandasMergeError("Only 'on' or 'left_index' and 'right_index' can be set")

if how == 'full':
print("Warning: While Koalas will accept 'full', you should use 'outer' instead to",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use warnings package like warnings.warn("...", UserWarning)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've changed it to warnings.warn(...) with 4e7c429.

@HyukjinKwon
Copy link
Member

Looks good otherwise. Let me leave the final check to @ueshin .

Copy link
Collaborator

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of nits, otherwise LGTM.

raise SparkPandasMergeError("Only 'on' or 'left_index' and 'right_index' can be set")

if how == 'full':
print("Warning: While Koalas will accept 'full', you should use 'outer' instead to",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need one more space at the end of the string: instead to "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When comma-separating strings in a print statement, Python will automatically add spaces in between. However, since 4e7c429 moves from print() to warnings.warn(), this has been taken care of.

how = 'full'
if how not in ('inner', 'left', 'right', 'full'):
raise ValueError("The 'how' parameter has to be amongst the following values: ",
"['inner', 'left', 'right', 'full', 'outer']")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should include 'full' in the message since it is optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I've removed the full option with 0ad39b8.

@ueshin
Copy link
Collaborator

ueshin commented May 9, 2019

Thanks! merging to master.

@ueshin ueshin merged commit 6fbefd8 into databricks:master May 9, 2019
raise SparkPandasMergeError("At least 'on' or 'left_index' and 'right_index' have ",
"to be set")
if on is not None and (left_index or right_index):
raise SparkPandasMergeError("Only 'on' or 'left_index' and 'right_index' can be set")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@floscha should we just raise ValueError rather than introducing a new exception type just for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main reason for introducing a new SparkPandasMergeError was the fact that pandas also has a dedicated MergeError instead of using the generic ValueError. That being said, I personally wouldn't mind using a ValueError in this case either 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Can you submit a PR to just remove this? I find it odd to have a special Error type just for this check .... since it's a different error from pandas' anyway, it's not going to make a difference from the api compatibility point of view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've just opened #323 to remove it.

@floscha floscha deleted the dataframe-merge branch May 27, 2019 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

koalas.merge

5 participants