Skip to content

ENH: .pipe() on DataFrameGroupBy #46655

@kwhkim

Description

@kwhkim

Is your feature request related to a problem?

DataFrameGroupBy.pipe() can not be used for UDF

This is related to Higher Order Methods API, and inconsistency.

From the doc I see

we have four methods: .pipe(), .apply(), .agg(), .transform(), applymap().

For DataFrame, .pipe() is applying a function to a DataFrame, whereas DataFrameGroupBy.pipe(func) is just a syntactic sugar(maybe) for DataFrameGroupyBy.func(). We can use UDF unless it's a proper method for DataFrameGroupBy(This is what doc says, and I experimented a little and it looks like so).

For DataFrameGroupBy, .apply() is applying a fuction to a grouped DataFrame, whereas DataFrameGroupBy.apply(func) is for applying func to the DataFrame's columns.

Describe the solution you'd like

I propose for consistency, using .pipe() for both DataFrame and DataFrameGroupBy to apply a function to a (grouped or not) DataFrame.

One other thought, do we really need .apply() for essentially doing .applymap()? For consistency I think .apply() better be reserved for applying a function to columns(axis=0) or rows(axis=1). And if we think of .apply() rather free method(in comparison to .agg()(function should be a reducer) and .transformer()(function should be a transformer)), we might better distinguish what would be the function input, for example naming .apply_df() and .apply_ser().

API breaking implications

.apply() should be banned from applying functions to a DataFrame and be specialized in applying functions to columns

Describe alternatives you've considered

let .apply() be as it is and adopt more specific method like .apply_df() and .apply_ser()

Additional context

Here is some exmple illustrating my point.

import numpy as np
import pandas as pd
from scipy import trim_mean
import functools

n = 1000
df = pd.DataFrame(
    {
        "Store": np.random.choice(["Store_1", "Store_2"], n),
        "Product": np.random.choice(["Product_1", "Product_2"], n),
        "Revenue": (np.random.random(n) * 50 + 10).round(2),
        "Quantity": np.random.randint(1, 10, size=n),
    }
)

f1 = functools.partial(trim_mean, proportiontocut =0.2)

df[['Revenue', 'Quantity']].pipe(f1)
## array([34.6658    ,  5.09666667])

df.groupby(['Store', 'Product']).pipe(f1)
## ValueError: Can only compare identically-labeled DataFrame objects

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapEnhancementNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions