Allow to omit type hint in GroupBy.transform, filter, apply #646

HyukjinKwon · 2019-08-14T09:01:53Z

This PR proposes to allow to omit type hint.

If type hint is not given, Koalas collects data as pandas DataFrame.
1.1. If it returns less than 1000, it returns ks.DataFrame(pdf).
1.2. if it returns more than 1000, it just infers schema from pdf.
If type hint is found, uses that as return type directly.

When the schema is inferred without type hint, index is cleanly kept.

When the schema is given via type hint, index is lost - this is because pandas sometimes keeps the index and sometimes not. Koalas does not know without executing the given func once.

Please see the example below:

pdf = pd.DataFrame({"timestamp":[0.0, 0.5, 1.0, 0.0, 0.5], "car_id": ['A','A','A','B','B']})
print(pdf.groupby('car_id').apply(lambda _: pd.DataFrame({"column": [0.0]})))

          column
car_id
A      0     0.0
B      0     0.0

pdf = pd.DataFrame({"timestamp":[0.0, 0.5, 1.0, 0.0, 0.5], "car_id": ['A','A','A','B','B']})
print(pdf.groupby('car_id').apply(lambda x: x))

   timestamp car_id
0        0.0      A
1        0.5      A
2        1.0      A
3        0.0      B
4        0.5      B

Therefore, index information is lost in this case

Resolves #628
This PR also resolves 2. and 3. at #409 (comment)

HyukjinKwon · 2019-08-14T09:03:08Z

@patryk-oleniuk, with this PR and #633 PR, you will be able to proceed further. Hope this unblocks you.

HyukjinKwon · 2019-08-14T09:04:01Z

databricks/koalas/groupby.py

            pdf = func(pdf)
-            # For now, just positionally map the column names to given schema's.
+
+            if retain_index:


Mainly only deduplication here.

softagram-bot · 2019-08-14T09:05:38Z

Softagram Impact Report for pull/646 (head commit: `78c5ccc`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/646

Give feedback on this report to [email protected]

codecov-io · 2019-08-14T09:22:35Z

Codecov Report

Merging #646 into master will increase coverage by 0.29%.
The diff coverage is 62.5%.

@@            Coverage Diff             @@
##           master     #646      +/-   ##
==========================================
+ Coverage   92.97%   93.27%   +0.29%     
==========================================
  Files          31       31              
  Lines        5085     5082       -3     
==========================================
+ Hits         4728     4740      +12     
+ Misses        357      342      -15

Impacted Files	Coverage Δ
databricks/koalas/groupby.py	`85.75% <62.5%> (+4.19%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51c0299...78c5ccc. Read the comment docs.

HyukjinKwon · 2019-08-14T09:52:19Z

tests passed. codecov is due to missing coverage hits in Python worker side

HyukjinKwon · 2019-08-14T10:22:03Z

databricks/koalas/groupby.py

+        if should_infer_schema:
+            # Here we execute with the first 1000 to get the return type.
+            # If the records were less than 1000, it uses pandas API directly for a shortcut.
+            limit = 1000


I think I should have a configuration for such limit in a separate PR. I'll do it after this and your PR are merged @ueshin

I merged my PR for the basic configuration.

The current `GroupBy.apply` uses a shortcut (proposed in #646) when schema inference is triggered and the number of records is less than or equal to `compute.shortcut_limit` (1000 by default), but this might cause issues such as #834. This PR proposes to remove the shortcut and make `GroupBy.apply` only infer the schema. Closes #834

HyukjinKwon requested a review from ueshin August 14, 2019 09:03

HyukjinKwon commented Aug 14, 2019

View reviewed changes

Allow to omit type hint in GroupBy.transform, filter, apply

78c5ccc

HyukjinKwon force-pushed the allow-no-type-hint branch from 15efc3c to 78c5ccc Compare August 14, 2019 09:05

HyukjinKwon commented Aug 14, 2019

View reviewed changes

ueshin approved these changes Aug 14, 2019

View reviewed changes

HyukjinKwon merged commit 6569e77 into databricks:master Aug 15, 2019

harupy mentioned this pull request Oct 2, 2019

Remove shortcut from GroupBy.apply #862

Merged

HyukjinKwon deleted the allow-no-type-hint branch November 6, 2019 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow to omit type hint in GroupBy.transform, filter, apply #646

Allow to omit type hint in GroupBy.transform, filter, apply #646

Uh oh!

HyukjinKwon commented Aug 14, 2019 •

edited

Loading

Uh oh!

HyukjinKwon commented Aug 14, 2019

Uh oh!

HyukjinKwon Aug 14, 2019

Uh oh!

softagram-bot commented Aug 14, 2019

Uh oh!

codecov-io commented Aug 14, 2019 •

edited

Loading

Uh oh!

HyukjinKwon commented Aug 14, 2019

Uh oh!

HyukjinKwon Aug 14, 2019

Uh oh!

ueshin Aug 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Allow to omit type hint in GroupBy.transform, filter, apply #646

Allow to omit type hint in GroupBy.transform, filter, apply #646

Uh oh!

Conversation

HyukjinKwon commented Aug 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 14, 2019

Uh oh!

HyukjinKwon Aug 14, 2019

Choose a reason for hiding this comment

Uh oh!

softagram-bot commented Aug 14, 2019

Softagram Impact Report for pull/646 (head commit: 78c5ccc)

⭐ Change Overview

📄 Full report

Uh oh!

codecov-io commented Aug 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

HyukjinKwon commented Aug 14, 2019

Uh oh!

HyukjinKwon Aug 14, 2019

Choose a reason for hiding this comment

Uh oh!

ueshin Aug 14, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Aug 14, 2019 •

edited

Loading

Softagram Impact Report for pull/646 (head commit: `78c5ccc`)

codecov-io commented Aug 14, 2019 •

edited

Loading