Skip to content

Conversation

@ueshin
Copy link
Collaborator

@ueshin ueshin commented May 6, 2020

groupby.apply in pandas<0.25 runs the functions twice for the first group, so if the function has a side effect, it could cause an unexpected behavior.

>>> import pandas as pd
>>> pd.__version__
'0.24.2'

>>> kdf = ks.DataFrame({"d": [1.0, 1.0, 1.0, 2.0, 2.0, 2.0], "v": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]})
>>> acc = ks.utils.default_session().sparkContext.accumulator(0)
>>> def sum_with_acc_frame(x) -> ks.DataFrame[np.float64, np.float64]:
...   global acc
...   acc += 1
...   return np.sum(x)
...
>>> kdf.groupby("d").apply(sum_with_acc_frame)
    c0    c1
0  3.0   6.0
1  6.0  15.0
>>> acc.value
4

whereas:

>>> import pandas as pd
>>> pd.__version__
'0.25.3'

...

>>> acc.value
2

The acc.value should be 2 since there are only 2 groups and the function should be called twice.

@ueshin ueshin requested a review from HyukjinKwon May 6, 2020 03:11
@HyukjinKwon HyukjinKwon changed the title Make groupby.apply in pandas<2.5 run the function only once. Make groupby.apply in pandas<0.25 run the function only once. May 6, 2020
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-io
Copy link

codecov-io commented May 6, 2020

Codecov Report

Merging #1462 into master will decrease coverage by 0.13%.
The diff coverage is 6.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1462      +/-   ##
==========================================
- Coverage   93.88%   93.75%   -0.14%     
==========================================
  Files          36       36              
  Lines        8373     8386      +13     
==========================================
+ Hits         7861     7862       +1     
- Misses        512      524      +12     
Impacted Files Coverage Δ
databricks/koalas/groupby.py 89.74% <6.66%> (-2.16%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9b4147...1d8b706. Read the comment docs.

@ueshin ueshin changed the title Make groupby.apply in pandas<0.25 run the function only once. Make groupby.apply in pandas<0.25 run the function only once per group. May 6, 2020
@itholic
Copy link
Contributor

itholic commented May 6, 2020

LGTM. Nice fixing.

@HyukjinKwon HyukjinKwon merged commit c5977b9 into databricks:master May 6, 2020
@ueshin ueshin deleted the groupby_apply branch May 6, 2020 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants