[Data] Streaming Partition enforce row_num per block#57984
[Data] Streaming Partition enforce row_num per block#57984raulchen merged 62 commits intoray-project:masterfrom
Conversation
6610c21 to
7e39adb
Compare
…or-streaming-repartition
6b1c2c3 to
ad81683
Compare
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com>
Bug: Test Fails to Verify Row Counts Post-RepartitioningThe |
|
@owenowenisme My first pass looks good. @bveeramani Will do a review for implementation design inside MapOperator. |
…t_task Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
srinathk10
left a comment
There was a problem hiding this comment.
LGTM but @alexeykudinkin or @bveeramani need review the impl design
bveeramani
left a comment
There was a problem hiding this comment.
I think the high-level idea sounds reasonable, but I think the current implementation adds a lot of complexity to the MapOperator interfaces.
Could you figure out how to implement this in a way that:
- Avoids introducing abstractions that overlap with existing ones (e.g.,
_TaskInput/TaskContextandStreamingRepartitionTaskBuilder/BlockRefBundler) - Avoids adding streaming-repartition-specific methods to the
MapOperatorbase class (e.g.,_submit_task_inputandset_task_input_builder) - Makes the correctness easy to test without requiring tens of E2E test cases?
…or-streaming-repartition
…or-streaming-repartition
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
| elif metadata.num_rows != block_slice.num_rows: | ||
| # Partial block - estimate size based on rows | ||
| per_row = metadata.size_bytes / metadata.num_rows | ||
| total += max(1, int(math.ceil(per_row * block_slice.num_rows))) |
There was a problem hiding this comment.
looks like we are double slicing the metadata? one here and one in _slice_block_metadata.
I think let's remove _slice_block_metadata and document that when slices are present, metadata is still the original metadata.
There was a problem hiding this comment.
actually _slice_block_metadata is wrong. because then you cannot slice an already-sliced block.
Let's fix it and add a unit test.
There was a problem hiding this comment.
Added unit test and remove _slice_block_metadata
| else: | ||
| assert len(self.blocks) == len( | ||
| self.slices | ||
| ), "Number of blocks and slices must match" |
There was a problem hiding this comment.
let's also validate the slices have valid ranges.
| """ | ||
|
|
||
|
|
||
| class StreamingRepartitionRefBundler(BaseRefBundler): |
There was a problem hiding this comment.
please also add unit tests for this class.
There was a problem hiding this comment.
Added in test_operators just like BlockRefBundler
| if self._total_pending_rows >= self._target_num_rows or flush_remaining: | ||
| rows_needed_from_last_bundle = ( | ||
| self._total_pending_rows % self._target_num_rows | ||
| ) |
There was a problem hiding this comment.
this seems wrong.
should be
self._total_pending_rows % self._target_num_rows - self._total_pending_rows % self._target_num_rows
There was a problem hiding this comment.
I think you meant self._pending_bundles[-1].num_rows() - self._total_pending_rows % self._target_num_rows ?
There was a problem hiding this comment.
Btw self._pending_bundles[-1].num_rows() - self._total_pending_rows % self._target_num_rows will never be negative, but I added assertion just in case
|
|
||
|
|
||
| class StreamingRepartitionRefBundler(BaseRefBundler): | ||
| """Incrementally builds task inputs to produce target-sized outputs. |
There was a problem hiding this comment.
Does this refbundler generate exactly the same as target_num_rows_per_block or multiplies of target_num_rows_per_block?
There was a problem hiding this comment.
Updated description
…edata Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
| elif metadata.num_rows != block_slice.num_rows: | ||
| # Partial block - estimate size based on rows | ||
| per_row = metadata.size_bytes / metadata.num_rows | ||
| total += max(1, int(math.ceil(per_row * block_slice.num_rows))) |
There was a problem hiding this comment.
Bug: Incorrect Size for Empty Data
When calculating size_bytes() for a slice with zero rows, the code uses max(1, int(math.ceil(per_row * block_slice.num_rows))) which returns 1 byte even when block_slice.num_rows is 0. An empty slice (0 rows) should contribute 0 bytes to the total size, not 1 byte. The max(1, ...) guard appears intended to prevent zero-byte estimates for non-empty slices but incorrectly applies to empty slices as well.
| rows_needed_from_last_bundle | ||
| ) | ||
| pending_bundles.append(sliced_bundle) | ||
| self._ready_bundles.append(RefBundle.merge_ref_bundles(pending_bundles)) |
There was a problem hiding this comment.
Bug: Bundle Exclusion Fails on Exact Completion
When rows_needed_from_last_bundle equals zero, the last bundle should be excluded from the ready bundle but isn't. This occurs when the last bundle's row count exactly equals the remainder (_total_pending_rows % _target_num_rows). For example, with 15 total rows, target of 10, and last bundle of 5 rows, the code outputs all 15 rows instead of outputting 10 rows and keeping 5 pending. The condition at line 39 should handle the zero case by removing the last bundle from pending_bundles before merging.
| assert flat_out == list(range(n)) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize( |
There was a problem hiding this comment.
nit, this should be put under tests/unit, as it's a uni test.
| # Test with empty blocks | ||
| 3, | ||
| [[[1]], [[]], [[2, 3]], [[]], [[4, 5]]], | ||
| [3, 2], # Expected: [1,2,3] and [4,5] |
There was a problem hiding this comment.
let's also check the block contents.
Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
## Description Currently, streaming repartition applies a map transform to each block independently and does not merge leftover rows across blocks, so it cannot guarantee exact row counts per output block. This PR introduces a new design that computes, on the driver, the input block ranges for every output block. It avoids driver-side block fetching while ensuring correctness and leveraging the efficiency of parallel map tasks. ## Related issues Closes ray-project#57165 ## Additional information --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
## Description Currently, streaming repartition applies a map transform to each block independently and does not merge leftover rows across blocks, so it cannot guarantee exact row counts per output block. This PR introduces a new design that computes, on the driver, the input block ranges for every output block. It avoids driver-side block fetching while ensuring correctness and leveraging the efficiency of parallel map tasks. ## Related issues Closes ray-project#57165 ## Additional information --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
## Description Currently, streaming repartition applies a map transform to each block independently and does not merge leftover rows across blocks, so it cannot guarantee exact row counts per output block. This PR introduces a new design that computes, on the driver, the input block ranges for every output block. It avoids driver-side block fetching while ensuring correctness and leveraging the efficiency of parallel map tasks. ## Related issues Closes ray-project#57165 ## Additional information --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
## Description Currently, streaming repartition applies a map transform to each block independently and does not merge leftover rows across blocks, so it cannot guarantee exact row counts per output block. This PR introduces a new design that computes, on the driver, the input block ranges for every output block. It avoids driver-side block fetching while ensuring correctness and leveraging the efficiency of parallel map tasks. ## Related issues Closes ray-project#57165 ## Additional information --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
## Description Currently, streaming repartition applies a map transform to each block independently and does not merge leftover rows across blocks, so it cannot guarantee exact row counts per output block. This PR introduces a new design that computes, on the driver, the input block ranges for every output block. It avoids driver-side block fetching while ensuring correctness and leveraging the efficiency of parallel map tasks. ## Related issues Closes ray-project#57165 ## Additional information --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
## Description Currently, streaming repartition applies a map transform to each block independently and does not merge leftover rows across blocks, so it cannot guarantee exact row counts per output block. This PR introduces a new design that computes, on the driver, the input block ranges for every output block. It avoids driver-side block fetching while ensuring correctness and leveraging the efficiency of parallel map tasks. ## Related issues Closes ray-project#57165 ## Additional information --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
## Description Currently, streaming repartition applies a map transform to each block independently and does not merge leftover rows across blocks, so it cannot guarantee exact row counts per output block. This PR introduces a new design that computes, on the driver, the input block ranges for every output block. It avoids driver-side block fetching while ensuring correctness and leveraging the efficiency of parallel map tasks. ## Related issues Closes ray-project#57165 ## Additional information --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Currently, streaming repartition applies a map transform to each block independently and does not merge leftover rows across blocks, so it cannot guarantee exact row counts per output block. This PR introduces a new design that computes, on the driver, the input block ranges for every output block. It avoids driver-side block fetching while ensuring correctness and leveraging the efficiency of parallel map tasks.
Related issues
Closes #57165
Additional information