Manage broadcast hint for dataframes. #1360

LucasG0 · 2020-03-20T13:08:59Z

Manage broadcast hint for dataframes.
Provides broadcast function which, from a given koalas DataFrame, returns a new one with broadcast hint.
Broadcast join can now be performed in DataFrame.join, DataFrame.merge and DataFrame.update.
A broadcast join may be more efficient than sort merge join (Spark default) between a small dataframe and a big daframe. It gives every node a copy of a the small dataframe, which reduces the number of shuffle between partitions. By default, Spark performs it if a dataframe is smaller than ~10MB, but the user should be able to force it.

codecov-io · 2020-03-20T13:32:34Z

Codecov Report

Merging #1360 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1360      +/-   ##
==========================================
+ Coverage   95.23%   95.26%   +0.02%     
==========================================
  Files          34       34              
  Lines        7742     7785      +43     
==========================================
+ Hits         7373     7416      +43     
  Misses        369      369

Impacted Files	Coverage Δ
databricks/koalas/frame.py	`96.82% <ø> (+0.05%)`	⬆️
databricks/koalas/namespace.py	`88.79% <100.00%> (+0.12%)`	⬆️
databricks/koalas/indexing.py	`94.56% <0.00%> (+0.01%)`	⬆️
databricks/koalas/internal.py	`96.05% <0.00%> (+0.01%)`	⬆️
databricks/koalas/generic.py	`97.54% <0.00%> (+0.40%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4012b6...789bd83. Read the comment docs.

ueshin · 2020-03-20T23:31:33Z

@LucasG0 Thanks for your first contribution!
I really like the idea to have a way to indicate broadcast join.

I guess we should have a discussion about how we should have the way.

Adding an argument to indicate broadcast join (the current way here)
- Pros:
  - Users can easily know which function supports broadcast join by its signature.
- Cons:
  - We usually hesitate to add extra arguments without a special reason.
  - We might need another way to indicate left side should be broadcasted.
    - (we can make broadcast take both left and right, or something, though.)
Adding broadcast method to DataFrame.
- Pros:
  - We don't need to manage each function one-by-one.
  - Users can easily apply both left and right sides.
- Cons:
  - Users might not know which function will be affected.
  - We usually hesitate to add extra functions without a special reason.
Adding broadcast function separately as the same as PySpark's functions.py.
- Pros:
  - We don't need to manage each function one-by-one.
  - Users can easily apply both left and right sides.
- Cons:
  - Users might not know which function will be affected.

cc @HyukjinKwon @itholic

itholic · 2020-03-21T09:45:35Z

@ueshin , Thanks for arranging the discussions! Let me take a look at this tomorrow.

@LucasG0 , I really welcome to your first contribution to Koalas. Thanks very much :D

LucasG0 · 2020-03-21T12:49:25Z

Thanks for your answers @ueshin @itholic ! :)

First, I would like to notice that only one side needs to be broadcasted during a brodcast join.
About the different ways to implement it :

Adding an argument to indicate broadcast join (the current way here).

It seems to be the only way to use the existing update method with a broadcast join.
However, I agree that managing functions one par one may be a significant issue.
Adding broadcast method to DataFrame.

I think it is a good way. We should choose between small_df.broadcast(large_df) or
large_df.broadcast(small_df). The first way seems more intuitive, as the small DataFrame
is the one to be broadcasted.
We may call it broadcast_join, to explicit that broadcast is used for joining purpose. It could
avoid confusion with broadcast variables in Spark, which are not necessarily used to join.
Adding broadcast function separately as the same as PySpark's functions.py.

As for the 2nd way, it could be broadcast_join(small_df,large_df).
However, I find the 2nd way fits more, as we want this functionality on DataFrame. Moreover, it
seems less explicit for user on which DataFrame should be broadcasted.

So I think the second way is the more interesting. :)

ueshin · 2020-03-21T21:18:40Z

Sorry, I have to describe more about 2) and 3).

I meant broadcast method or function should return Koalas DataFrame containing Spark DataFrame with broadcast hint.

So the usage of 2) and 3) in my mind were like:

left_df.merge(right_df.broadcast(), ...)

or

left_df.merge(F.broadcast(right_df), ...)

instead of adding broadcast_join.

Then we can reuse the current implementation without any additional work in each join-like function and Spark will handle the broadcast hint properly.

LucasG0 · 2020-03-22T12:38:03Z

Alright, these ways are indeed better !
We could use 2) as it targets specifically DataFrame or 3) to avoid adding extra methods in DataFrame.

itholic · 2020-03-23T01:28:29Z

left_df.merge(right_df.broadcast(), ...)

I also think this way above seems good for now.

(But in the future, It would be better to use something like AQE to make it work automatically ??)

HyukjinKwon · 2020-03-23T03:08:55Z

We could even do both 2) and 3) like DataFrame.merge, koalas.merge, DataFrame.melt and koalas.melt.

One nit on 2) is though, DataFrame.broadcast isn't friendly to users who come from PySpark compared to koalas.broadcast. But I don't really have a strong preference. Let me leave it to @ueshin.

ueshin · 2020-03-23T18:25:51Z

Then shall we take 3) ?
@LucasG0 Could you update this PR as we discussed? Thanks!

itholic · 2020-03-24T00:37:51Z

Then shall we take 3) ?

Yes I agree. I feel Hyukjin's latest comment makes sense.

LucasG0 · 2020-03-24T22:47:10Z

Let's go for 3) then !
It seems that there is no way to test that a PySpark DataFrame has broadcast hint.
It is possible to test if a broadcast join occurred, by using the test of PySpark itself
https://github.com/apache/spark/pull/8801/files#diff-7c2fe8530271c0635fb99f7b49e0c4a4R1086.
However, I wonder if it is relevant to reproduce this test in Koalas tests, so for now I just tested equality between koalas DataFrame and the one with broadcast hint.

ueshin

Could you add a line for a doc in docs/source/reference/general_functions.rst? Around line 28 should be good.
Also, could you add See Also link in the doc for each function?

ueshin · 2020-03-24T22:49:41Z

databricks/koalas/namespace.py

+
+    if not isinstance(obj, DataFrame):
+        raise ValueError("Invalid type : expected DataFrame got {}".format(type(obj)))
+    return ks.DataFrame(data=spark.functions.broadcast(obj._internal.to_external_spark_frame))


We can use with_new_sdf:

return DataFrame(obj._interval.with_new_sdf(F.broadcast(obj._sdf)))

ueshin · 2020-03-24T22:52:38Z

databricks/koalas/namespace.py

        return pd.to_numeric(arg)


+#


Do you want to add a comment here? Otherwise, please remove the unrelated line.

ueshin · 2020-03-24T22:52:52Z

databricks/koalas/namespace.py

+        ...                    columns=['rkey', 'value'])
+
+        >>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey')
+    """


Shall we show the result? Or how about explain?

>>> merged.explain() # doctest: +ELLIPSIS == Physical Plan == ... ...Broadcast... ...

merged.explain() seems to have a different behavior depending on runtime environment.
That is why I did not detailed Physical Plan and removed # doctest : +ELLIPSIS.

Why don't we need ELLIPSIS?

Sorry, I was not familiar with this notation.
I added it back.

nvm, I was just curious why it works without # doctest: +ELLIPSIS.

ueshin · 2020-03-24T22:54:50Z

databricks/koalas/frame.py

            raise ValueError(
                "columns overlap but no suffix specified: " "{rename}".format(rename=common)
            )
+


Could you revert unrelated changes?

ueshin · 2020-03-24T23:02:59Z

databricks/koalas/namespace.py

+        >>> df1 = ks.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
+        ...                     'value': [1, 2, 3, 5]},
+        ...                    columns=['lkey', 'value'])
+


nit: remove an extra line.

ueshin · 2020-03-24T23:03:04Z

databricks/koalas/namespace.py

+        >>> df2 = ks.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
+        ...                     'value': [5, 6, 7, 8]},
+        ...                    columns=['rkey', 'value'])
+


itholic · 2020-03-25T02:09:42Z

databricks/koalas/namespace.py


+#
+def broadcast(obj):
+    """ Marks a DataFrame as small enough for use in broadcast joins.


nit: Could you move this docstring to next line like other methods?

like:

def broadcast(obj): """ Marks a DataFrame as small enough for use in broadcast joins ... """

ueshin · 2020-03-26T18:52:12Z

databricks/koalas/namespace.py

+
+    See Also
+    --------
+    DataFrame.merge : Merge DataFrame objects with a database-style join.


Shall we add DataFrame.join and DataFrame.update as well?
Also, could you add a link to DataFrame.broadcast from the docs for each function?

I will add them.
Do you mean ks.broadcast in "See Also" bloc of these methods ?

Yes. sorry, ks.broadcast is right, and maybe only broadcast should work in the docstring.

See Also -------- broadcast : ...

In deed, ks.broadcast does not work in the docstring.

ueshin · 2020-03-26T19:05:33Z

databricks/koalas/namespace.py

+        ...                    columns=['rkey', 'value'])
+
+        >>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey')
+    """


nvm, I was just curious why it works without # doctest: +ELLIPSIS.

ueshin

Otherwise, LGTM.

ueshin · 2020-03-27T00:18:53Z

LGTM.
@LucasG0 Could you update the PR description?

itholic · 2020-03-27T00:29:36Z

LGTM, too. 👍

ueshin · 2020-03-27T17:15:03Z

Thanks! merging.

LucasG0 · 2020-03-29T01:30:50Z

Nice, thanks !

ueshin reviewed Mar 24, 2020

View reviewed changes

itholic reviewed Mar 25, 2020

View reviewed changes

LucasG0 force-pushed the master branch 2 times, most recently from ba64e4d to e8318cb Compare March 26, 2020 12:39

ueshin reviewed Mar 26, 2020

View reviewed changes

ueshin approved these changes Mar 26, 2020

View reviewed changes

Add broadcast function in namespace.py

789bd83

LucasG0 force-pushed the master branch from e8318cb to 789bd83 Compare March 26, 2020 22:13

LucasG0 changed the title ~~Add broadcast join for dataframes.~~ Manage broadcast hint for dataframes. Mar 27, 2020

ueshin merged commit 0366308 into databricks:master Mar 27, 2020

Manage broadcast hint for dataframes. #1360

Manage broadcast hint for dataframes. #1360

Uh oh!

Conversation

LucasG0 commented Mar 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Mar 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ueshin commented Mar 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itholic commented Mar 21, 2020

Uh oh!

LucasG0 commented Mar 21, 2020

Uh oh!

ueshin commented Mar 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasG0 commented Mar 22, 2020

Uh oh!

itholic commented Mar 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Mar 23, 2020

Uh oh!

ueshin commented Mar 23, 2020

Uh oh!

itholic commented Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasG0 commented Mar 24, 2020

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic Mar 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasG0 Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

ueshin commented Mar 27, 2020

LucasG0 commented Mar 20, 2020 •

edited

Loading

codecov-io commented Mar 20, 2020 •

edited

Loading

ueshin commented Mar 20, 2020 •

edited

Loading

ueshin commented Mar 21, 2020 •

edited

Loading

itholic commented Mar 23, 2020 •

edited

Loading

itholic commented Mar 24, 2020 •

edited

Loading

itholic Mar 25, 2020 •

edited

Loading

LucasG0 Mar 26, 2020 •

edited

Loading