Skip to content

Incorrect type of output of DataFrameGroupBy.apply #2234

Closed
@gshimansky

Description

@gshimansky

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):

Ubuntu 19.10

  • Modin version (modin.__version__):

0.8.1.1+4.gbd8f73a

  • Python version:

Python 3.8.3

  • Code we can use to reproduce:
import modin.pandas as pd

x = pd.DataFrame(
    {"id2": ["id056", "id075", "id077", "id072", "id010"],
     "id4": [82, 30, 40, 92, 34],
     "v1": [4, 1, 5, 1, 5],
     "v2": [3, 2, 3, 3, 2]
    }
)
gb = x.groupby(['id2','id4'], observed=True)
print(gb)
print("groups = ", gb.groups)
print("len(groups) = ", len(gb.groups))
print("indices = ", gb.indices)

df = gb.apply(lambda x: pd.Series({'r2': 12345}))
print(type(df))
print(df)

Describe the problem

This is a simplified problem reproducer of code from https://github.com/h2oai/db-benchmark . Output data structure type is modin.pandas.series.Series while on Pandas it is pandas.core.frame.DataFrame. H2o benchmark later calls reset_index on this variable https://github.com/h2oai/db-benchmark/blob/master/modin/groupby-modin.py#L260 and this produces a exception TypeError: Cannot reset_index inplace on a Series to create a DataFrame. Also resulting Series object is unprintable, printing it produces infinite recursion.

There is a workaround, if apply function uses pandas.Series object, this code works. But it doesn't work with modin native Series object.

This bug is very likely to be related to bug #1682 .

Source code / logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug 🦗Something isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions