-
Notifications
You must be signed in to change notification settings - Fork 367
Manage broadcast hint for dataframes. #1360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1360 +/- ##
==========================================
+ Coverage 95.23% 95.26% +0.02%
==========================================
Files 34 34
Lines 7742 7785 +43
==========================================
+ Hits 7373 7416 +43
Misses 369 369
Continue to review full report at Codecov.
|
|
@LucasG0 Thanks for your first contribution! I guess we should have a discussion about how we should have the way.
|
|
Thanks for your answers @ueshin @itholic ! :) First, I would like to notice that only one side needs to be broadcasted during a brodcast join.
So I think the second way is the more interesting. :) |
|
Sorry, I have to describe more about 2) and 3). I meant So the usage of 2) and 3) in my mind were like: left_df.merge(right_df.broadcast(), ...)or left_df.merge(F.broadcast(right_df), ...)instead of adding Then we can reuse the current implementation without any additional work in each join-like function and Spark will handle the broadcast hint properly. |
|
Alright, these ways are indeed better ! |
I also think this way above seems good for now. (But in the future, It would be better to use something like AQE to make it work automatically ??) |
|
We could even do both 2) and 3) like One nit on 2) is though, |
|
Then shall we take 3) ? |
Yes I agree. I feel Hyukjin's latest comment makes sense. |
|
Let's go for 3) then ! |
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a line for a doc in docs/source/reference/general_functions.rst? Around line 28 should be good.
Also, could you add See Also link in the doc for each function?
databricks/koalas/namespace.py
Outdated
|
|
||
| if not isinstance(obj, DataFrame): | ||
| raise ValueError("Invalid type : expected DataFrame got {}".format(type(obj))) | ||
| return ks.DataFrame(data=spark.functions.broadcast(obj._internal.to_external_spark_frame)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use with_new_sdf:
return DataFrame(obj._interval.with_new_sdf(F.broadcast(obj._sdf)))
databricks/koalas/namespace.py
Outdated
| return pd.to_numeric(arg) | ||
|
|
||
|
|
||
| # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to add a comment here? Otherwise, please remove the unrelated line.
| ... columns=['rkey', 'value']) | ||
| >>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey') | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we show the result? Or how about explain?
>>> merged.explain() # doctest: +ELLIPSIS
== Physical Plan ==
...
...Broadcast...
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged.explain() seems to have a different behavior depending on runtime environment.
That is why I did not detailed Physical Plan and removed # doctest : +ELLIPSIS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need ELLIPSIS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was not familiar with this notation.
I added it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I was just curious why it works without # doctest: +ELLIPSIS.
databricks/koalas/frame.py
Outdated
| raise ValueError( | ||
| "columns overlap but no suffix specified: " "{rename}".format(rename=common) | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you revert unrelated changes?
databricks/koalas/namespace.py
Outdated
| >>> df1 = ks.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], | ||
| ... 'value': [1, 2, 3, 5]}, | ||
| ... columns=['lkey', 'value']) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove an extra line.
databricks/koalas/namespace.py
Outdated
| >>> df2 = ks.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], | ||
| ... 'value': [5, 6, 7, 8]}, | ||
| ... columns=['rkey', 'value']) | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
databricks/koalas/namespace.py
Outdated
|
|
||
| # | ||
| def broadcast(obj): | ||
| """ Marks a DataFrame as small enough for use in broadcast joins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could you move this docstring to next line like other methods?
like:
def broadcast(obj):
"""
Marks a DataFrame as small enough for use in broadcast joins
...
"""ba64e4d to
e8318cb
Compare
| See Also | ||
| -------- | ||
| DataFrame.merge : Merge DataFrame objects with a database-style join. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add DataFrame.join and DataFrame.update as well?
Also, could you add a link to DataFrame.broadcast from the docs for each function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add them.
Do you mean ks.broadcast in "See Also" bloc of these methods ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. sorry, ks.broadcast is right, and maybe only broadcast should work in the docstring.
See Also
--------
broadcast : ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In deed, ks.broadcast does not work in the docstring.
| ... columns=['rkey', 'value']) | ||
| >>> merged = df1.merge(ks.broadcast(df2), left_on='lkey', right_on='rkey') | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, I was just curious why it works without # doctest: +ELLIPSIS.
ueshin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, LGTM.
|
LGTM. |
|
LGTM, too. 👍 |
|
Thanks! merging. |
|
Nice, thanks ! |
Manage broadcast hint for dataframes.
Provides
broadcastfunction which, from a given koalas DataFrame, returns a new one with broadcast hint.Broadcast join can now be performed in
DataFrame.join,DataFrame.mergeandDataFrame.update.A broadcast join may be more efficient than sort merge join (Spark default) between a small dataframe and a big daframe. It gives every node a copy of a the small dataframe, which reduces the number of shuffle between partitions. By default, Spark performs it if a dataframe is smaller than ~10MB, but the user should be able to force it.