Skip to content

[Data] Fix resource reservation by excluding completed operators' usages#56319

Merged
alexeykudinkin merged 12 commits intoray-project:masterfrom
xinyuangui2:remove-upstreams-ineligible-from-reserves
Sep 10, 2025
Merged

[Data] Fix resource reservation by excluding completed operators' usages#56319
alexeykudinkin merged 12 commits intoray-project:masterfrom
xinyuangui2:remove-upstreams-ineligible-from-reserves

Conversation

@xinyuangui2
Copy link
Contributor

@xinyuangui2 xinyuangui2 commented Sep 8, 2025

Problem

The ReservationOpResourceAllocator was incorrectly accounting for resource usage when calculating available resources for reservation. Specifically, it wasn't properly handling completed operators who have blocks in the output queue.

The ReadFiles operator below consumes 50 GB of object store memory and should be excluded from reservation, but it is currently not.

image

Solution

Added logic to identify and subtract resource usage specifically from completed physical operators:

Testing results

Before the fix

image

After the fix

image

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 changed the title [data] remove upstream ineligible resource from reservation [Data] Fix resource reservation by excluding ineligible operators without eligible upstream Sep 8, 2025
@xinyuangui2 xinyuangui2 marked this pull request as ready for review September 8, 2025 19:20
@xinyuangui2 xinyuangui2 requested a review from a team as a code owner September 8, 2025 19:20
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
@ray-gardener ray-gardener bot added the data Ray Data-related issues label Sep 8, 2025
Signed-off-by: xgui <xgui@anyscale.com>
@xinyuangui2 xinyuangui2 changed the title [Data] Fix resource reservation by excluding ineligible operators without eligible upstream [Data] Fix resource reservation by excluding completed operators' usages Sep 9, 2025
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
xinyuangui2 and others added 3 commits September 9, 2025 12:14
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Signed-off-by: xgui <xgui@anyscale.com>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) September 10, 2025 17:18
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 10, 2025
@alexeykudinkin alexeykudinkin merged commit 48a6e7f into ray-project:master Sep 10, 2025
7 checks passed
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…ges (ray-project#56319)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Problem
The `ReservationOpResourceAllocator` was incorrectly accounting for
resource usage when calculating available resources for reservation.
Specifically, it wasn't properly handling completed operators who have
blocks in the output queue.

The `ReadFiles` operator below consumes 50 GB of object store memory and
should be excluded from reservation, but it is currently not.

<img width="1628" height="281" alt="image"
src="https://github.com/user-attachments/assets/8a80902d-7f88-4263-bc97-a3dee519b401"
/>

## Solution
Added logic to identify and subtract resource usage specifically from
completed physical operators:

## Testing results

Before the fix

<img width="958" height="653" alt="image"
src="https://github.com/user-attachments/assets/432dc94e-bbe1-4ecb-b1c3-a6e201da724a"
/>

After the fix

<img width="1567" height="700" alt="image"
src="https://github.com/user-attachments/assets/2b780d68-a208-4c6b-a250-dee0823e9083"
/>


## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: zac <zac@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…ges (#56319)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Problem
The `ReservationOpResourceAllocator` was incorrectly accounting for
resource usage when calculating available resources for reservation.
Specifically, it wasn't properly handling completed operators who have
blocks in the output queue.

The `ReadFiles` operator below consumes 50 GB of object store memory and
should be excluded from reservation, but it is currently not.

<img width="1628" height="281" alt="image"
src="https://github.com/user-attachments/assets/8a80902d-7f88-4263-bc97-a3dee519b401"
/>

## Solution
Added logic to identify and subtract resource usage specifically from
completed physical operators:

## Testing results

Before the fix

<img width="958" height="653" alt="image"
src="https://github.com/user-attachments/assets/432dc94e-bbe1-4ecb-b1c3-a6e201da724a"
/>

After the fix

<img width="1567" height="700" alt="image"
src="https://github.com/user-attachments/assets/2b780d68-a208-4c6b-a250-dee0823e9083"
/>

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ges (ray-project#56319)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Problem
The `ReservationOpResourceAllocator` was incorrectly accounting for
resource usage when calculating available resources for reservation.
Specifically, it wasn't properly handling completed operators who have
blocks in the output queue.

The `ReadFiles` operator below consumes 50 GB of object store memory and
should be excluded from reservation, but it is currently not.

<img width="1628" height="281" alt="image"
src="https://github.com/user-attachments/assets/8a80902d-7f88-4263-bc97-a3dee519b401"
/>

## Solution
Added logic to identify and subtract resource usage specifically from
completed physical operators:

## Testing results

Before the fix

<img width="958" height="653" alt="image"
src="https://github.com/user-attachments/assets/432dc94e-bbe1-4ecb-b1c3-a6e201da724a"
/>

After the fix

<img width="1567" height="700" alt="image"
src="https://github.com/user-attachments/assets/2b780d68-a208-4c6b-a250-dee0823e9083"
/>


## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ges (ray-project#56319)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Problem
The `ReservationOpResourceAllocator` was incorrectly accounting for
resource usage when calculating available resources for reservation.
Specifically, it wasn't properly handling completed operators who have
blocks in the output queue.

The `ReadFiles` operator below consumes 50 GB of object store memory and
should be excluded from reservation, but it is currently not.

<img width="1628" height="281" alt="image"
src="https://github.com/user-attachments/assets/8a80902d-7f88-4263-bc97-a3dee519b401"
/>

## Solution
Added logic to identify and subtract resource usage specifically from
completed physical operators:

## Testing results

Before the fix

<img width="958" height="653" alt="image"
src="https://github.com/user-attachments/assets/432dc94e-bbe1-4ecb-b1c3-a6e201da724a"
/>

After the fix

<img width="1567" height="700" alt="image"
src="https://github.com/user-attachments/assets/2b780d68-a208-4c6b-a250-dee0823e9083"
/>


## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: xgui <xgui@anyscale.com>
Signed-off-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>
Co-authored-by: Alexey Kudinkin <alexey.kudinkin@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants