[data] HashShuffleAggregator break down block on finalize by iamjustinhsu · Pull Request #58603 · ray-project/ray

iamjustinhsu · 2025-11-13T19:49:34Z

Description

HashShuffleAggregator currently doesn't break big blocks into smaller blocks (or combine smaller blocks into bigger ones). For large blocks, this can be very problematic because then block being returned will spill to disk. Consider the following scenario:

A node with 200 GiB memory, 100 GiB disk
Ray core allocates 50% of memory to object store (so 100GiB heap memory, 100GiB object store)
You return a block with 150GiB of memory, which fits in memory (doesn't OOM)
This object spills to disk, but since the disk size << block, the node OOD.

Why this is better

Practically speaking, this can happen a lot with AWS and GCP nodes, because nodes typically have higher memory than disk space. So yielding smaller blocks can utilize streaming_gen backpressure to avoid materializing the entire object.
However, even if this wasn't the case (suppose we have nodes with large disk with low memory), we shouldn't be storing large objects / blocks like that because it hurts task/block based parallelism. You can solve this by using a StreamingRepartition, but that is more work for the user.
In some cases, you may have enough object store + disk space to store the block, but it doesn't store contiguously. For example, 50GiB of object store remaining, 25GiB of disk remaining, but you want to store a 70GiB Block. Ray core can't split the object, but if you yield smaller smaller blocks, they can fit on that node.

This PR addresses this by using OutputBlockBuffer to reshape the blocks back to data_context.target_max_block_size.

Related issues

None

Additional information

Encountered this personally with 180GiB block, which would OOD

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

gemini-code-assist

Code Review

This pull request correctly identifies and aims to solve an important issue with HashShuffleAggregator handling very large blocks, which can lead to out-of-memory errors. The approach of using BlockOutputBuffer to break down large blocks is sound. However, the current implementation of the finalize method introduces several critical issues, including a risk of deadlocks, potential data loss, and incorrect metrics reporting. My review provides a detailed comment with a suggested replacement for the finalize method that addresses these problems while preserving the original intent of the change.

python/ray/data/_internal/execution/operators/hash_shuffle.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/execution/operators/hash_shuffle.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/_internal/execution/operators/hash_shuffle.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…/aggregator-yield-block-size

python/ray/data/_internal/execution/operators/hash_shuffle.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

python/ray/data/tests/test_dynamic_block_split.py

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu · 2025-11-14T01:08:10Z

python/ray/data/_internal/execution/operators/hash_shuffle.py

            if partition_id in self._finalizing_tasks:
                self._finalizing_tasks.pop(partition_id)

+                # Update Finalize Metrics on task completion


Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

richardliaw · 2025-11-15T01:37:46Z

who is reviewing this pr?

bveeramani · 2025-11-19T02:47:32Z

python/ray/data/_internal/execution/operators/hash_shuffle.py

+        # so we do not break the block down further.
+        if target_max_block_size is not None:
+            # Creating a block output buffer per partition finalize task because:
+            # 1. Need to keep track of which tasks have already been finalized


I couldn't understand what (1) means. Could you elaborate/revise?

updated, I don't think 1 makes sense too lol. My intent was that I can keep track of re-finalizing tasks, but that would lead to additional stats + maybe additional locks = more complexity, so kept it simple

bveeramani · 2025-11-19T02:48:13Z

python/ray/data/_internal/execution/operators/hash_shuffle.py

@@ -1560,17 +1570,38 @@ def submit(self, input_seq_id: int, partition_id: int, partition_shard: Block):
    def finalize(
        self, partition_id: int
    ) -> AsyncGenerator[Union[Block, "BlockMetadataWithSchema"], None]:


Nit: Out-of-scope for this PR, but I think this is a regular generator, not async

Suggested change

) -> AsyncGenerator[Union[Block, "BlockMetadataWithSchema"], None]:

) -> Generator[Union[Block, "BlockMetadataWithSchema"], None]:

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…t#58603) ## Description `HashShuffleAggregator` currently doesn't break big blocks into smaller blocks (or combine smaller blocks into bigger ones). For large blocks, this can be very problematic. This PR addresses this by using `OutputBlockBuffer` to reshape the blocks back to `data_context.target_max_block_size` ## Related issues None ## Additional information Encountered this personally with 180GiB block, which would OOD --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…t#58603) ## Description `HashShuffleAggregator` currently doesn't break big blocks into smaller blocks (or combine smaller blocks into bigger ones). For large blocks, this can be very problematic. This PR addresses this by using `OutputBlockBuffer` to reshape the blocks back to `data_context.target_max_block_size` ## Related issues None ## Additional information Encountered this personally with 180GiB block, which would OOD --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>

…t#58603) ## Description `HashShuffleAggregator` currently doesn't break big blocks into smaller blocks (or combine smaller blocks into bigger ones). For large blocks, this can be very problematic. This PR addresses this by using `OutputBlockBuffer` to reshape the blocks back to `data_context.target_max_block_size` ## Related issues None ## Additional information Encountered this personally with 180GiB block, which would OOD --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…t#58603) ## Description `HashShuffleAggregator` currently doesn't break big blocks into smaller blocks (or combine smaller blocks into bigger ones). For large blocks, this can be very problematic. This PR addresses this by using `OutputBlockBuffer` to reshape the blocks back to `data_context.target_max_block_size` ## Related issues None ## Additional information Encountered this personally with 180GiB block, which would OOD --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

[data] HashShuffleAggregator break down block on finalize

37f7d5c

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu requested a review from a team as a code owner November 13, 2025 19:49

gemini-code-assist bot reviewed Nov 13, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/hash_shuffle.py Show resolved Hide resolved

each partition gets output buffer

b799105

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed Nov 13, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/hash_shuffle.py Show resolved Hide resolved

python/ray/data/_internal/execution/operators/hash_shuffle.py Show resolved Hide resolved

fix

b7b239a

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed Nov 13, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/hash_shuffle.py Outdated Show resolved Hide resolved

iamjustinhsu added 2 commits November 13, 2025 13:52

finalize

da91733

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into jhsu…

ab87601

…/aggregator-yield-block-size

cursor bot reviewed Nov 13, 2025

View reviewed changes

python/ray/data/_internal/execution/operators/hash_shuffle.py Show resolved Hide resolved

python/ray/data/_internal/execution/operators/hash_shuffle.py Outdated Show resolved Hide resolved

one buffer per partition

ec0e610

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu force-pushed the jhsu/aggregator-yield-block-size branch from 26404f8 to ec0e610 Compare November 13, 2025 22:25

iamjustinhsu added 3 commits November 13, 2025 14:27

oops

e58c093

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

oops

a73b527

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

fix metrics finalization

6bb8742

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

cursor bot reviewed Nov 14, 2025

View reviewed changes

python/ray/data/tests/test_dynamic_block_split.py Outdated Show resolved Hide resolved

move test back

77529c5

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu commented Nov 14, 2025

View reviewed changes

update comment

db1982f

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

ray-gardener bot added the data Ray Data-related issues label Nov 14, 2025

bveeramani approved these changes Nov 19, 2025

View reviewed changes

iamjustinhsu added 3 commits November 19, 2025 10:21

elaborate

6e32fe9

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

updated to sync generator

cf4ca93

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

fix typing

d1f95eb

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

iamjustinhsu added the go add ONLY when ready to merge, run all tests label Nov 20, 2025

bveeramani merged commit 06fd709 into ray-project:master Nov 21, 2025
7 checks passed

iamjustinhsu deleted the jhsu/aggregator-yield-block-size branch November 21, 2025 20:08

iamjustinhsu mentioned this pull request Nov 25, 2025

[data] fix map groups don't break down blocks #58988

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] HashShuffleAggregator break down block on finalize#58603

[data] HashShuffleAggregator break down block on finalize#58603
bveeramani merged 14 commits intoray-project:masterfrom
iamjustinhsu:jhsu/aggregator-yield-block-size

iamjustinhsu commented Nov 13, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu Nov 14, 2025

Uh oh!

richardliaw commented Nov 15, 2025

Uh oh!

bveeramani Nov 19, 2025

Uh oh!

iamjustinhsu Nov 19, 2025

Uh oh!

bveeramani Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	) -> AsyncGenerator[Union[Block, "BlockMetadataWithSchema"], None]:
	) -> Generator[Union[Block, "BlockMetadataWithSchema"], None]:

Conversation

iamjustinhsu commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Why this is better

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iamjustinhsu Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

richardliaw commented Nov 15, 2025

Uh oh!

bveeramani Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

iamjustinhsu Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

bveeramani Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iamjustinhsu commented Nov 13, 2025 •

edited

Loading