-
Notifications
You must be signed in to change notification settings - Fork 367
Allow to omit type hint in GroupBy.transform, filter, apply #646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@patryk-oleniuk, with this PR and #633 PR, you will be able to proceed further. Hope this unblocks you. |
| pdf = func(pdf) | ||
| # For now, just positionally map the column names to given schema's. | ||
|
|
||
| if retain_index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly only deduplication here.
15efc3c to
78c5ccc
Compare
Softagram Impact Report for pull/646 (head commit: 78c5ccc)⭐ Change Overview
📄 Full report
Give feedback on this report to [email protected] |
Codecov Report
@@ Coverage Diff @@
## master #646 +/- ##
==========================================
+ Coverage 92.97% 93.27% +0.29%
==========================================
Files 31 31
Lines 5085 5082 -3
==========================================
+ Hits 4728 4740 +12
+ Misses 357 342 -15
Continue to review full report at Codecov.
|
|
tests passed. codecov is due to missing coverage hits in Python worker side |
| if should_infer_schema: | ||
| # Here we execute with the first 1000 to get the return type. | ||
| # If the records were less than 1000, it uses pandas API directly for a shortcut. | ||
| limit = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should have a configuration for such limit in a separate PR. I'll do it after this and your PR are merged @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I merged my PR for the basic configuration.
The current `GroupBy.apply` uses a shortcut (proposed in #646) when schema inference is triggered and the number of records is less than or equal to `compute.shortcut_limit` (1000 by default), but this might cause issues such as #834. This PR proposes to remove the shortcut and make `GroupBy.apply` only infer the schema. Closes #834

This PR proposes to allow to omit type hint.
1.1. If it returns less than 1000, it returns
ks.DataFrame(pdf).1.2. if it returns more than 1000, it just infers schema from
pdf.When the schema is inferred without type hint, index is cleanly kept.
When the schema is given via type hint, index is lost - this is because pandas sometimes keeps the index and sometimes not. Koalas does not know without executing the given
funconce.Please see the example below:
Therefore, index information is lost in this case
Resolves #628
This PR also resolves 2. and 3. at #409 (comment)