Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Ubuntu 19.10
- Modin version (
modin.__version__
):
0.8.1.1+4.gbd8f73a
- Python version:
Python 3.8.3
- Code we can use to reproduce:
import modin.pandas as pd
x = pd.DataFrame(
{"id2": ["id056", "id075", "id077", "id072", "id010"],
"id4": [82, 30, 40, 92, 34],
"v1": [4, 1, 5, 1, 5],
"v2": [3, 2, 3, 3, 2]
}
)
gb = x.groupby(['id2','id4'], observed=True)
print(gb)
print("groups = ", gb.groups)
print("len(groups) = ", len(gb.groups))
print("indices = ", gb.indices)
df = gb.apply(lambda x: pd.Series({'r2': 12345}))
print(type(df))
print(df)
Describe the problem
This is a simplified problem reproducer of code from https://github.com/h2oai/db-benchmark . Output data structure type is modin.pandas.series.Series
while on Pandas it is pandas.core.frame.DataFrame
. H2o benchmark later calls reset_index
on this variable https://github.com/h2oai/db-benchmark/blob/master/modin/groupby-modin.py#L260 and this produces a exception TypeError: Cannot reset_index inplace on a Series to create a DataFrame
. Also resulting Series
object is unprintable, printing it produces infinite recursion.
There is a workaround, if apply
function uses pandas.Series
object, this code works. But it doesn't work with modin native Series
object.
This bug is very likely to be related to bug #1682 .