-
-
Notifications
You must be signed in to change notification settings - Fork 19.5k
Description
Is your feature request related to a problem?
DataFrameGroupBy.pipe() can not be used for UDF
This is related to Higher Order Methods API, and inconsistency.
From the doc I see
we have four methods: .pipe(), .apply(), .agg(), .transform(), applymap().
For DataFrame, .pipe() is applying a function to a DataFrame, whereas DataFrameGroupBy.pipe(func) is just a syntactic sugar(maybe) for DataFrameGroupyBy.func(). We can use UDF unless it's a proper method for DataFrameGroupBy(This is what doc says, and I experimented a little and it looks like so).
For DataFrameGroupBy, .apply() is applying a fuction to a grouped DataFrame, whereas DataFrameGroupBy.apply(func) is for applying func to the DataFrame's columns.
Describe the solution you'd like
I propose for consistency, using .pipe() for both DataFrame and DataFrameGroupBy to apply a function to a (grouped or not) DataFrame.
One other thought, do we really need .apply() for essentially doing .applymap()? For consistency I think .apply() better be reserved for applying a function to columns(axis=0) or rows(axis=1). And if we think of .apply() rather free method(in comparison to .agg()(function should be a reducer) and .transformer()(function should be a transformer)), we might better distinguish what would be the function input, for example naming .apply_df() and .apply_ser().
API breaking implications
.apply() should be banned from applying functions to a DataFrame and be specialized in applying functions to columns
Describe alternatives you've considered
let .apply() be as it is and adopt more specific method like .apply_df() and .apply_ser()
Additional context
Here is some exmple illustrating my point.
import numpy as np
import pandas as pd
from scipy import trim_mean
import functools
n = 1000
df = pd.DataFrame(
{
"Store": np.random.choice(["Store_1", "Store_2"], n),
"Product": np.random.choice(["Product_1", "Product_2"], n),
"Revenue": (np.random.random(n) * 50 + 10).round(2),
"Quantity": np.random.randint(1, 10, size=n),
}
)
f1 = functools.partial(trim_mean, proportiontocut =0.2)
df[['Revenue', 'Quantity']].pipe(f1)
## array([34.6658 , 5.09666667])
df.groupby(['Store', 'Product']).pipe(f1)
## ValueError: Can only compare identically-labeled DataFrame objects