Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Aug 14, 2019

This PR proposes to allow to omit type hint.

  1. If type hint is not given, Koalas collects data as pandas DataFrame.
    1.1. If it returns less than 1000, it returns ks.DataFrame(pdf).
    1.2. if it returns more than 1000, it just infers schema from pdf.
  2. If type hint is found, uses that as return type directly.
  • When the schema is inferred without type hint, index is cleanly kept.

  • When the schema is given via type hint, index is lost - this is because pandas sometimes keeps the index and sometimes not. Koalas does not know without executing the given func once.

    Please see the example below:

    pdf = pd.DataFrame({"timestamp":[0.0, 0.5, 1.0, 0.0, 0.5], "car_id": ['A','A','A','B','B']})
    print(pdf.groupby('car_id').apply(lambda _: pd.DataFrame({"column": [0.0]})))
              column
    car_id
    A      0     0.0
    B      0     0.0
    
    pdf = pd.DataFrame({"timestamp":[0.0, 0.5, 1.0, 0.0, 0.5], "car_id": ['A','A','A','B','B']})
    print(pdf.groupby('car_id').apply(lambda x: x))
       timestamp car_id
    0        0.0      A
    1        0.5      A
    2        1.0      A
    3        0.0      B
    4        0.5      B
    

    Therefore, index information is lost in this case

Resolves #628
This PR also resolves 2. and 3. at #409 (comment)

@HyukjinKwon
Copy link
Member Author

@patryk-oleniuk, with this PR and #633 PR, you will be able to proceed further. Hope this unblocks you.

@HyukjinKwon HyukjinKwon requested a review from ueshin August 14, 2019 09:03
pdf = func(pdf)
# For now, just positionally map the column names to given schema's.

if retain_index:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly only deduplication here.

@softagram-bot
Copy link

Softagram Impact Report for pull/646 (head commit: 78c5ccc)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to [email protected]

@codecov-io
Copy link

codecov-io commented Aug 14, 2019

Codecov Report

Merging #646 into master will increase coverage by 0.29%.
The diff coverage is 62.5%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #646      +/-   ##
==========================================
+ Coverage   92.97%   93.27%   +0.29%     
==========================================
  Files          31       31              
  Lines        5085     5082       -3     
==========================================
+ Hits         4728     4740      +12     
+ Misses        357      342      -15
Impacted Files Coverage Δ
databricks/koalas/groupby.py 85.75% <62.5%> (+4.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51c0299...78c5ccc. Read the comment docs.

@HyukjinKwon
Copy link
Member Author

tests passed. codecov is due to missing coverage hits in Python worker side

if should_infer_schema:
# Here we execute with the first 1000 to get the return type.
# If the records were less than 1000, it uses pandas API directly for a shortcut.
limit = 1000
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should have a configuration for such limit in a separate PR. I'll do it after this and your PR are merged @ueshin

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged my PR for the basic configuration.

@HyukjinKwon HyukjinKwon merged commit 6569e77 into databricks:master Aug 15, 2019
HyukjinKwon pushed a commit that referenced this pull request Oct 3, 2019
The current `GroupBy.apply` uses a shortcut (proposed in #646) when schema inference is triggered and the number of records is less than or equal to `compute.shortcut_limit` (1000 by default), but this might cause issues such as #834.

This PR proposes to remove the shortcut and make `GroupBy.apply` only infer the schema.

Closes #834
@HyukjinKwon HyukjinKwon deleted the allow-no-type-hint branch November 6, 2019 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Koalas groupby.apply does not keep the index whereas pandas does.

4 participants