"Predicate pushdown" in group-bys #7

wesm · 2016-08-31T03:29:56Z

xref #15

I brought this up at SciPy 2015, but there's a significant performance win available in expressions like:

df[boolean_cond].groupby(grouping_exprs).agg(agg_expr)

If you do this currently, it will produce a fully materialized copy of df even if the groupby only touches a small portion of the DataFrame. Ideally, we'd have:

df.groupby(grouping_exprs, where=boolean_cond).agg(...)

I put this as a design / pandas2 issue because the boolean bytes / bits will need to get pushed down into the various C-level groupby subroutines.

The text was updated successfully, but these errors were encountered:

chris-b1 · 2016-09-01T02:12:17Z

On the mailing list, you mentioned the idea of an "expression VM", this feels like the kind of
thing that would be nicely handled by that? Just making up an API, something like this, where a delayed df builds up a dask/numexpr like graph that can be optimized.

df = pd.read_csv(...)
with pd.delayed(df) as df:
   df['val'] = df['val'] + 100.
   <... several intermediate expressions ... >
   answer = df[cond].groupby(expr).agg(...).compute()

# `df` is unmodified, only `answer` is computed, hopefully very efficiently

Although that's really broad so maybe this is a useful enough case to just build directly into groupby ops.

wesm · 2016-09-01T03:52:19Z

Yeah, the idea behind an "expression VM" is similar to the design of APL interpreters. This is a bigger topic than this issue, but normal pandas operations would be implemented through the eager evaluation of operators in pandas's internal set of functions. Once you have multiple operators you can begin to think about optimizing the evaluation or rearranging the query plan. SFrame (RIP?) notably does this

https://github.com/turi-code/SFrame/tree/master/oss_src/sframe_query_engine

wesm added performance memory-use labels Sep 1, 2016

wesm mentioned this issue Sep 16, 2016

Alternate groupby API that is more functionally consistent with databases or systems like dplyr #15

Open

jreback mentioned this issue Sep 30, 2016

require numexpr / numba #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Predicate pushdown" in group-bys #7

"Predicate pushdown" in group-bys #7

wesm commented Aug 31, 2016 •

edited by jreback

Loading

chris-b1 commented Sep 1, 2016

wesm commented Sep 1, 2016

"Predicate pushdown" in group-bys #7

"Predicate pushdown" in group-bys #7

Comments

wesm commented Aug 31, 2016 • edited by jreback Loading

chris-b1 commented Sep 1, 2016

wesm commented Sep 1, 2016

wesm commented Aug 31, 2016 •

edited by jreback

Loading