Skip to content

Understanding Groupbyapply #834

@devarshml

Description

@devarshml

Hello there, firstly thank you for such an amazing package that bridges the gap between Pandas and PySpark.
I started using koalas approximately 1 week back and everything was intuitive till the time i stumbled upon koalas.Groupby.Apply.

Code:

if __name__ == '__main__':

        ks_df = ks.DataFrame(features_data)
        ks_df_info_abt_train = ks_df.groupby(['div_nbr', 'store_nbr']).apply(_koalas_train)
        
        def _koalas_train(frame):
                  out_frame = frame.copy()
                  out_frame = frame['trans_type_value'].sum()
                  return out_frame

Here features_data is a pd.Dataframe.

Output from Koalas.Groupby.Apply:

Screen Shot 2019-09-27 at 2 27 26 PM

Output from Pandas.Groupby.Apply:
Screen Shot 2019-09-27 at 2 30 13 PM

As you can see, the output from pandas Groupby apply is as expected, but the output from Koalas Groupby apply is not right. Could you guid me towards the right direction by pointing out any logical mistake that i might have made or anything else.
Thank you once again.

Koalas version - 0.18.0
Pandas version - 0.23.4
PySpark - 2.4.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions