[Data] Add local aggregation fast-path for small datasets (#61016) by kaveti · Pull Request #61402 · ray-project/ray

kaveti · 2026-02-28T18:31:57Z

For small datasets (below a configurable threshold, default 10 MiB), groupby/aggregate now executes entirely on the driver using existing map/reduce primitives instead of spawning a distributed actor pool. This eliminates actor startup and coordination overhead that caused ~350x slowdown vs pandas on 1M-row single-node workloads.

The fast-path is controlled by:

DataContext.small_dataset_agg_threshold_bytes (default: 10 MiB)
Env var: RAY_DATA_SMALL_DATASET_AGG_THRESHOLD_BYTES

Set threshold to 0 to always use distributed aggregation.

Fixes #61016

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

…t#61016) For small datasets (below a configurable threshold, default 10 MiB), groupby/aggregate now executes entirely on the driver using existing map/reduce primitives instead of spawning a distributed actor pool. This eliminates actor startup and coordination overhead that caused ~350x slowdown vs pandas on 1M-row single-node workloads. The fast-path is controlled by: - DataContext.small_dataset_agg_threshold_bytes (default: 10 MiB) - Env var: RAY_DATA_SMALL_DATASET_AGG_THRESHOLD_BYTES Set threshold to 0 to always use distributed aggregation. Fixes ray-project#61016 Signed-off-by: rkaveti <kavetiraviteja1992@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a valuable optimization for groupby/aggregate operations on small datasets by adding a local aggregation fast-path. The implementation is clean, well-contained, and effectively reuses existing map/reduce primitives to avoid the overhead of a distributed actor pool. The new feature is controlled by a configurable threshold, which is a good design choice. I have one minor suggestion to improve memory efficiency.

python/ray/data/_internal/planner/aggregate.py

The previous implementation placed the small-dataset fast-path inside generate_aggregate_fn, but the default shuffle strategy (HASH_SHUFFLE) bypasses that function entirely via plan_all_to_all_op.py, making the fast-path unreachable in the common case. Fix: - plan_all_to_all_op: when threshold > 0 and HASH_SHUFFLE is set, fall through to generate_aggregate_fn (AllToAllOperator) instead of immediately returning HashAggregateOperator. Small data gets the local fast-path; large data falls back to sort-based distributed agg. Set threshold=0 to always use HashAggregateOperator unchanged. - aggregate.py: remove the sort-strategy-only assert and default the large-data scheduler to pull-based (covers HASH_SHUFFLE fallback). Signed-off-by: rkaveti <kavetiraviteja1992@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: rkaveti <kavetiraviteja1992@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

cursor · 2026-02-28T18:45:16Z

python/ray/data/_internal/planner/aggregate.py

+    This avoids the overhead of spawning a distributed actor pool when the
+    total input data size is below the configured threshold.
+    """
+    blocks = ray.get(ref for bundle in refs for ref in bundle.block_refs)


Generator expression passed to ray.get causes ValueError

High Severity

ray.get() is called with a bare generator expression instead of a list. Internally, ray.get checks isinstance(object_refs, list) and raises a ValueError if the argument is not a list or ObjectRef. A generator expression is neither, so _local_aggregate will always crash at runtime, making the entire small-dataset fast-path non-functional.

cursor · 2026-02-28T18:45:16Z

python/ray/data/_internal/planner/plan_all_to_all_op.py

+            # Otherwise fall through to generate_aggregate_fn, which will run
+            # local aggregation for small datasets and sort-based distributed
+            # aggregation for larger ones. Users can set the threshold to 0 to
+            # always use hash-shuffle aggregation.


Hash shuffle silently falls back to sort shuffle

Medium Severity

When shuffle_strategy is HASH_SHUFFLE and small_dataset_agg_threshold_bytes > 0 (the default), datasets exceeding the threshold silently fall through to generate_aggregate_fn, which uses PullBasedShuffleTaskScheduler instead of the HashAggregateOperator the user configured. This is a behavioral regression — users who explicitly chose HASH_SHUFFLE get sort-based aggregation for large datasets without any warning.

Additional Locations (1)

python/ray/data/_internal/planner/aggregate.py#L148-L154

kaveti requested a review from a team as a code owner February 28, 2026 18:31

kaveti force-pushed the fix/data-groupby-local-agg-fast-path branch from 46363c8 to 9c76ece Compare February 28, 2026 18:32

gemini-code-assist bot reviewed Feb 28, 2026

View reviewed changes

python/ray/data/_internal/planner/aggregate.py Outdated Show resolved Hide resolved

kaveti and others added 2 commits March 1, 2026 00:09

Update python/ray/data/_internal/planner/aggregate.py

9f26908

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: rkaveti <kavetiraviteja1992@gmail.com>

cursor bot reviewed Feb 28, 2026

View reviewed changes

ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Feb 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add local aggregation fast-path for small datasets (#61016)#61402

[Data] Add local aggregation fast-path for small datasets (#61016)#61402
kaveti wants to merge 3 commits intoray-project:masterfrom
kaveti:fix/data-groupby-local-agg-fast-path

kaveti commented Feb 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 28, 2026

Uh oh!

cursor bot Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaveti commented Feb 28, 2026

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 28, 2026

Choose a reason for hiding this comment

Generator expression passed to ray.get causes ValueError

Uh oh!

cursor bot Feb 28, 2026

Choose a reason for hiding this comment

Hash shuffle silently falls back to sort shuffle

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant