Closed
Description
Reproducer:
import modin.pandas as pd
import numpy as np
nrows = 256
ncols = 128
data = {
f"col{i}": np.random.randint(0, 100, nrows)
for i in np.arange(ncols)
}
agg_fn = {"max": ("col1", np.max), "min": ("col127", np.min)}
df = pd.DataFrame(data)
res = df.groupby("col0").agg(**agg_fn) # KeyError: 'col127' does not exist
print(res)
Describe the problem
That happens because In the current implementation we assume, that every column from dict function exists in every partition, however that's not true. We probably should check that partition on what we're applying on, contains all columns from dict, otherwise drop them from dict