You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 10, 2024. It is now read-only.
If you do this currently, it will produce a fully materialized copy of df even if the groupby only touches a small portion of the DataFrame. Ideally, we'd have:
On the mailing list, you mentioned the idea of an "expression VM", this feels like the kind of
thing that would be nicely handled by that? Just making up an API, something like this, where a delayed df builds up a dask/numexpr like graph that can be optimized.
df=pd.read_csv(...)
withpd.delayed(df) asdf:
df['val'] =df['val'] +100.<... severalintermediateexpressions ... >answer=df[cond].groupby(expr).agg(...).compute()
# `df` is unmodified, only `answer` is computed, hopefully very efficiently
Although that's really broad so maybe this is a useful enough case to just build directly into groupby ops.
Yeah, the idea behind an "expression VM" is similar to the design of APL interpreters. This is a bigger topic than this issue, but normal pandas operations would be implemented through the eager evaluation of operators in pandas's internal set of functions. Once you have multiple operators you can begin to think about optimizing the evaluation or rearranging the query plan. SFrame (RIP?) notably does this
xref #15
I brought this up at SciPy 2015, but there's a significant performance win available in expressions like:
If you do this currently, it will produce a fully materialized copy of
df
even if the groupby only touches a small portion of the DataFrame. Ideally, we'd have:I put this as a design / pandas2 issue because the boolean bytes / bits will need to get pushed down into the various C-level groupby subroutines.
The text was updated successfully, but these errors were encountered: