Skip to content

Multi-column named groupby aggregation is broken #2406

Closed
@dchigarev

Description

@dchigarev

Reproducer:

import modin.pandas as pd
import numpy as np

nrows = 256
ncols = 128
data = {
    f"col{i}": np.random.randint(0, 100, nrows)
    for i in np.arange(ncols)
}

agg_fn = {"max": ("col1", np.max), "min": ("col127", np.min)}

df = pd.DataFrame(data)
res = df.groupby("col0").agg(**agg_fn) # KeyError: 'col127' does not exist
print(res)

Describe the problem

That happens because In the current implementation we assume, that every column from dict function exists in every partition, however that's not true. We probably should check that partition on what we're applying on, contains all columns from dict, otherwise drop them from dict

Metadata

Metadata

Assignees

Labels

bug 🦗Something isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions