Skip to content

PERF: Use DataFrame-level reductions in DataFrame.agg with list of funcs#65031

Merged
rhshadrach merged 2 commits into
pandas-dev:mainfrom
jbrockmendel:perf-45658
Apr 10, 2026
Merged

PERF: Use DataFrame-level reductions in DataFrame.agg with list of funcs#65031
rhshadrach merged 2 commits into
pandas-dev:mainfrom
jbrockmendel:perf-45658

Conversation

@jbrockmendel

Copy link
Copy Markdown
Member

Summary

  • When DataFrame.agg receives a list of string function names (e.g. ["sum", "mean"]), use DataFrame-level reductions per dtype group instead of extracting each column as a Series and calling Series.agg per column.
  • For a 1000-column DataFrame, this reduces the time for df.agg(["sum"]) from ~110ms to ~0.5ms (~220x speedup).
  • Falls back to the existing per-column path for non-string functions, duplicate column names, or non-reduction methods.

closes #45658

Test plan

  • All existing pandas/tests/apply/ tests pass (922 passed)
  • All pandas/tests/reductions/ tests pass (546 passed)
  • Verified correctness with mixed dtypes (int/float), extension types (Int64, Float64), string columns, empty DataFrames, duplicate column names, and lambda fallback

🤖 Generated with Claude Code

When DataFrame.agg receives a list of function names (e.g. ["sum"]),
use DataFrame-level reductions per dtype group instead of extracting
each column as a Series and calling Series.agg per column.

closes pandas-dev#45658

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 2, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jbrockmendel jbrockmendel marked this pull request as ready for review April 9, 2026 15:04
@jbrockmendel

Copy link
Copy Markdown
Member Author

cc @rhshadrach

Comment thread pandas/core/apply.py
# Compute reductions per dtype group to preserve per-column dtypes.
# Using to_frame().T for each result avoids the slow
# DataFrame(list-of-Series) construction path.
groups = obj.columns.groupby(obj.dtypes) # type: ignore[arg-type]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this - but do you feel certain that we can rely on equality of dtypes here? I don't know of any examples that would cause problems, just wondering if there are edge cases where dtypes would give as equal when there is some subtle difference (e.g. time resolution).

As long as it's the case that if two dtypes say they are equal when they are not precisely equal we would call this a bug, I'm good here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't prevent a hypothetical 3rd party EADtype from lying about its equality, but im pretty confident this works as expected for all our dtypes.

@rhshadrach rhshadrach added the Apply Apply, Aggregate, Transform, Map label Apr 10, 2026

@rhshadrach rhshadrach left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach added this to the 3.1 milestone Apr 10, 2026
@rhshadrach rhshadrach merged commit 593c2df into pandas-dev:main Apr 10, 2026
51 checks passed
@rhshadrach

Copy link
Copy Markdown
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the perf-45658 branch April 10, 2026 21:17
Sharl0tteIsTaken added a commit to Sharl0tteIsTaken/pandas that referenced this pull request Apr 12, 2026
…-comparison

* upstream/main:
  PERF: use lookup instead of hash_inner_join for merge with unique right keys (pandas-dev#64691)
  BUG : update `SeriesGroupBy.ohlc()` to honor `as_index=False` (pandas-dev#65141)
  PERF: Use DataFrame-level reductions in DataFrame.agg with list of funcs (pandas-dev#65031)
  DOC: document required external libraries in read_* I/O docstrings (pandas-dev#65143)
  DOC: improve MultiIndex.is_monotonic_increasing/decreasing docstrings (pandas-dev#65154)
  BUG: Raise ValueError for non-boolean numeric_only in DataFrame/Series reductions (GH#53098) (pandas-dev#65131)
  BUG: Timedelta.round() raises ZeroDivisionError when internal unit is 's' and target frequency is sub-second (pandas-dev#64836)
  ENH: Add replace method to Index (closes pandas-dev#19495) (pandas-dev#65099)
  PERF: improve StringArray.isna (pandas-dev#57733)
  BUG: read parquet files with older pytz (DEP: keep lower pytz minimum version) (pandas-dev#65133)
  DEPR: deprecate dates-with-datetime64 in _maybe_downcast_for_indexing (pandas-dev#64871)
  DOC: note that DataFrame.values is not writeable (pandas-dev#65142)
  CLN: Update groupby observed defaults (pandas-dev#65148)
  PERF: avoid materializing values[indexer] in Block.setitem (pandas-dev#64251)
  DOC: update GroupBy.sum/min/max See Also sections (pandas-dev#65144)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apply Apply, Aggregate, Transform, Map Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PERF: Calling df.agg([function]) is much slower than df.agg(function) when there are many columns and few rows.

2 participants