Skip to content

[Data] Add local aggregation fast-path for small datasets (#61016)#61402

Open
kaveti wants to merge 3 commits intoray-project:masterfrom
kaveti:fix/data-groupby-local-agg-fast-path
Open

[Data] Add local aggregation fast-path for small datasets (#61016)#61402
kaveti wants to merge 3 commits intoray-project:masterfrom
kaveti:fix/data-groupby-local-agg-fast-path

Conversation

@kaveti
Copy link

@kaveti kaveti commented Feb 28, 2026

For small datasets (below a configurable threshold, default 10 MiB), groupby/aggregate now executes entirely on the driver using existing map/reduce primitives instead of spawning a distributed actor pool. This eliminates actor startup and coordination overhead that caused ~350x slowdown vs pandas on 1M-row single-node workloads.

The fast-path is controlled by:

  • DataContext.small_dataset_agg_threshold_bytes (default: 10 MiB)
  • Env var: RAY_DATA_SMALL_DATASET_AGG_THRESHOLD_BYTES

Set threshold to 0 to always use distributed aggregation.

Fixes #61016

Thank you for contributing to Ray! 🚀
Please review the Ray Contribution Guide before opening a pull request.

⚠️ Remove these instructions before submitting your PR.

💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete.

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@kaveti kaveti requested a review from a team as a code owner February 28, 2026 18:31
…t#61016)

For small datasets (below a configurable threshold, default 10 MiB),
groupby/aggregate now executes entirely on the driver using existing
map/reduce primitives instead of spawning a distributed actor pool.
This eliminates actor startup and coordination overhead that caused
~350x slowdown vs pandas on 1M-row single-node workloads.

The fast-path is controlled by:
- DataContext.small_dataset_agg_threshold_bytes (default: 10 MiB)
- Env var: RAY_DATA_SMALL_DATASET_AGG_THRESHOLD_BYTES

Set threshold to 0 to always use distributed aggregation.

Fixes ray-project#61016

Signed-off-by: rkaveti <kavetiraviteja1992@gmail.com>
@kaveti kaveti force-pushed the fix/data-groupby-local-agg-fast-path branch from 46363c8 to 9c76ece Compare February 28, 2026 18:32
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable optimization for groupby/aggregate operations on small datasets by adding a local aggregation fast-path. The implementation is clean, well-contained, and effectively reuses existing map/reduce primitives to avoid the overhead of a distributed actor pool. The new feature is controlled by a configurable threshold, which is a good design choice. I have one minor suggestion to improve memory efficiency.

kaveti and others added 2 commits March 1, 2026 00:09
The previous implementation placed the small-dataset fast-path inside
generate_aggregate_fn, but the default shuffle strategy (HASH_SHUFFLE)
bypasses that function entirely via plan_all_to_all_op.py, making the
fast-path unreachable in the common case.

Fix:
- plan_all_to_all_op: when threshold > 0 and HASH_SHUFFLE is set,
  fall through to generate_aggregate_fn (AllToAllOperator) instead of
  immediately returning HashAggregateOperator. Small data gets the local
  fast-path; large data falls back to sort-based distributed agg.
  Set threshold=0 to always use HashAggregateOperator unchanged.
- aggregate.py: remove the sort-strategy-only assert and default the
  large-data scheduler to pull-based (covers HASH_SHUFFLE fallback).

Signed-off-by: rkaveti <kavetiraviteja1992@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: rkaveti <kavetiraviteja1992@gmail.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

This avoids the overhead of spawning a distributed actor pool when the
total input data size is below the configured threshold.
"""
blocks = ray.get(ref for bundle in refs for ref in bundle.block_refs)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generator expression passed to ray.get causes ValueError

High Severity

ray.get() is called with a bare generator expression instead of a list. Internally, ray.get checks isinstance(object_refs, list) and raises a ValueError if the argument is not a list or ObjectRef. A generator expression is neither, so _local_aggregate will always crash at runtime, making the entire small-dataset fast-path non-functional.

Fix in Cursor Fix in Web

# Otherwise fall through to generate_aggregate_fn, which will run
# local aggregation for small datasets and sort-based distributed
# aggregation for larger ones. Users can set the threshold to 0 to
# always use hash-shuffle aggregation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hash shuffle silently falls back to sort shuffle

Medium Severity

When shuffle_strategy is HASH_SHUFFLE and small_dataset_agg_threshold_bytes > 0 (the default), datasets exceeding the threshold silently fall through to generate_aggregate_fn, which uses PullBasedShuffleTaskScheduler instead of the HashAggregateOperator the user configured. This is a behavioral regression — users who explicitly chose HASH_SHUFFLE get sort-based aggregation for large datasets without any warning.

Additional Locations (1)

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Slow Ray Data groupby/aggregate on small dataset on single-node local cluster Ray fails to serialize self-reference objects

1 participant