Add basic merge functionality to DataFrame #264

floscha · 2019-05-08T10:30:12Z

This PR adds the basic merge functionality from pandas to Koalas DataFrames as requested in #232. While the PR does not cover 100% of pandas' functionality, it covers the basic use cases to be useful enough to start with.

Resolves #232

databricks/koalas/frame.py

codecov-io · 2019-05-08T11:42:58Z

Codecov Report

Merging #264 into master will increase coverage by 0.59%.
The diff coverage is 97.95%.

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
+ Coverage   92.35%   92.95%   +0.59%     
==========================================
  Files          35       35              
  Lines        3205     3263      +58     
==========================================
+ Hits         2960     3033      +73     
+ Misses        245      230      -15

Impacted Files	Coverage Δ
databricks/koalas/missing/frame.py	`100% <ø> (ø)`	⬆️
databricks/koalas/exceptions.py	`79.41% <100%> (+1.28%)`	⬆️
databricks/koalas/tests/test_dataframe.py	`100% <100%> (ø)`	⬆️
databricks/koalas/frame.py	`94.76% <92.3%> (+2.32%)`	⬆️
databricks/koalas/dask/utils.py	`66.66% <0%> (-6.67%)`	⬇️
databricks/koalas/generic.py	`94.93% <0%> (-0.07%)`	⬇️
databricks/koalas/namespace.py	`90.34% <0%> (ø)`	⬆️
databricks/koalas/series.py	`91.72% <0%> (+0.44%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a888dfd...0ad39b8. Read the comment docs.

ueshin · 2019-05-08T12:29:23Z

databricks/koalas/tests/test_dataframe.py

+        # Assert full outer join also works with 'full' keyword
+        res = left_kdf.merge(right_kdf, how='full')
+        # FIXME Replace None with np.nan once #263 is solved
+        self.assert_eq(res, pd.DataFrame({'A': [1, 2, np.nan], 'B': [None, 'x', 'y']}))


Could you add tests to use suffixes?

Done with 1407922.

ueshin · 2019-05-08T12:33:19Z

databricks/koalas/tests/test_dataframe.py

+
+        # Assert inner join
+        res = left_kdf.merge(right_kdf)
+        self.assert_eq(res, pd.DataFrame({'A': [2], 'B': ['x']}))


Is this behavior following pandas'?
I might miss something, but I got an error from pandas:

>>> import pandas as pd >>> pd.__version__ '0.24.2' >>> left_pdf = pd.DataFrame({'A': [1, 2]}) >>> right_pdf = pd.DataFrame({'B': ['x', 'y']}, index=[1, 2]) >>> left_pdf.merge(right_pdf) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../pandas/core/frame.py", line 6868, in merge copy=copy, indicator=indicator, validate=validate) File "/.../pandas/core/reshape/merge.py", line 47, in merge validate=validate) File "/.../pandas/core/reshape/merge.py", line 524, in __init__ self._validate_specification() File "/.../pandas/core/reshape/merge.py", line 1033, in _validate_specification lidx=self.left_index, ridx=self.right_index)) pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

Btw, I think basically we should use exactly the same pandas code as the counterpart of the assert_eq().

You're right. Without specifying any columns to join on, the current implementation defaults to the pandas equivalent of left_pdf.merge(right_pdf, left_index=True, right_index=True). Would you prefer the addition of those parameters and the respective MergeError to follow the behavior of pandas?

Regarding the tests, I totally agree with you. But because of #263, it currently fails as some data types mismatch.

I think we don't need to add the parameters here, but at least we should respect the default value of pandas, e.g., assuming left_index=False, and right_index=False.

As for the MergeError, hmm, interesting.
I remember I added SparkPandasIndexingError for pandas IndexingError before. Maybe we can add SparkPandasMergeError.

I've now added the SparkPandasMergeError with abfe0c9.

… 3.5

rxin · 2019-05-08T17:11:51Z

databricks/koalas/frame.py

+        >>> left_kdf = ks.DataFrame({'A': [1, 2]})
+        >>> right_kdf = ks.DataFrame({'B': ['x', 'y']}, index=[1, 2])
+
+        >>> left_kdf.merge(right_kdf)


in this case what is the join key?

This is outdated. Now it should be

left_kdf.merge(right_kdf, left_index=True, right_index=True)

to join on the indices.

HyukjinKwon · 2019-05-08T23:58:12Z

databricks/koalas/frame.py

+            raise SparkPandasMergeError("Only 'on' or 'left_index' and 'right_index' can be set")
+
+        if how == 'full':
+            print("Warning: While Koalas will accept 'full', you should use 'outer' instead to",


Can we use warnings package like warnings.warn("...", UserWarning)

Good point. I've changed it to warnings.warn(...) with 4e7c429.

HyukjinKwon · 2019-05-09T00:02:28Z

Looks good otherwise. Let me leave the final check to @ueshin .

ueshin

I left a couple of nits, otherwise LGTM.

ueshin · 2019-05-09T01:21:04Z

databricks/koalas/frame.py

+            raise SparkPandasMergeError("Only 'on' or 'left_index' and 'right_index' can be set")
+
+        if how == 'full':
+            print("Warning: While Koalas will accept 'full', you should use 'outer' instead to",


Need one more space at the end of the string: instead to "

When comma-separating strings in a print statement, Python will automatically add spaces in between. However, since 4e7c429 moves from print() to warnings.warn(), this has been taken care of.

ueshin · 2019-05-09T01:26:17Z

databricks/koalas/frame.py

+            how = 'full'
+        if how not in ('inner', 'left', 'right', 'full'):
+            raise ValueError("The 'how' parameter has to be amongst the following values: ",
+                             "['inner', 'left', 'right', 'full', 'outer']")


I don't think we should include 'full' in the message since it is optional.

Fair enough. I've removed the full option with 0ad39b8.

ueshin · 2019-05-09T08:51:55Z

Thanks! merging to master.

rxin · 2019-05-13T20:44:43Z

databricks/koalas/frame.py

+            raise SparkPandasMergeError("At least 'on' or 'left_index' and 'right_index' have ",
+                                        "to be set")
+        if on is not None and (left_index or right_index):
+            raise SparkPandasMergeError("Only 'on' or 'left_index' and 'right_index' can be set")


@floscha should we just raise ValueError rather than introducing a new exception type just for this?

I think the main reason for introducing a new SparkPandasMergeError was the fact that pandas also has a dedicated MergeError instead of using the generic ValueError. That being said, I personally wouldn't mind using a ValueError in this case either 😉

Got it. Can you submit a PR to just remove this? I find it odd to have a special Error type just for this check .... since it's a different error from pandas' anyway, it's not going to make a difference from the api compatibility point of view.

Sure, I've just opened #323 to remove it.

Add basic merge functionality to DataFrame

2c05bc1

floscha mentioned this pull request May 8, 2019

Converting NaN objects falsely turns NaN into None #263

Closed

Remove merge from list of missing methods

a0b97d3

HyukjinKwon reviewed May 8, 2019

View reviewed changes

databricks/koalas/frame.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed May 8, 2019

View reviewed changes

databricks/koalas/frame.py Outdated Show resolved Hide resolved

floscha added 4 commits May 8, 2019 12:52

Remove pointless 'pass' call

eca3989

Inline doc to pandas'

5472af1

Adjust first line of doc

dde5b9c

Use 'ks' instead of 'koalas' for doctest

1508281

ueshin reviewed May 8, 2019

View reviewed changes

floscha added 6 commits May 8, 2019 14:55

Add unit test for 'suffixes' parameter

1407922

Use OrderedDict to also assure column order with Python 3.5

65ce286

Add and apply SparkPandasMergeError

abfe0c9

Explicitly set columns to also assure their correct order with Python…

df85c35

… 3.5

Fix linter issue by removing whitespace before ':'

e2e835d

Use indices for joining in doctests

104d1ac

rxin reviewed May 8, 2019

View reviewed changes

HyukjinKwon reviewed May 8, 2019

View reviewed changes

ueshin reviewed May 9, 2019

View reviewed changes

floscha added 2 commits May 9, 2019 08:25

Use warnings instead of print

4e7c429

Remove 'full' from printed merge options

0ad39b8

ueshin approved these changes May 9, 2019

View reviewed changes

ueshin merged commit 6fbefd8 into databricks:master May 9, 2019

rxin reviewed May 13, 2019

View reviewed changes

floscha mentioned this pull request May 14, 2019

Remove SparkPandasMergeError #323

Merged

floscha mentioned this pull request May 27, 2019

Add DataFrame.merge to frame.rst #387

Merged

floscha deleted the dataframe-merge branch May 27, 2019 08:24

Add basic merge functionality to DataFrame #264

Add basic merge functionality to DataFrame #264

Uh oh!

Conversation

floscha commented May 8, 2019 • edited by HyukjinKwon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-io commented May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin May 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 9, 2019

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin commented May 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

floscha commented May 8, 2019 •

edited by HyukjinKwon

Loading

codecov-io commented May 8, 2019 •

edited

Loading

ueshin May 8, 2019 •

edited

Loading