[data] fix resource allocator not respecting max resource requirement#59412
Conversation
…ment Signed-off-by: Hao Chen <chenh1024@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request correctly fixes an issue where the resource allocator was not respecting the maximum resource requirements of an operator. The changes involve capping the shared resource allocation for each operator and are well-tested with a new unit test that validates the budget calculation. I've found one potential issue where leftover resources are redistributed without checking the maximum resource limits of the receiving operator. My detailed feedback is in the review comments.
Signed-off-by: Hao Chen <chenh1024@gmail.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request correctly introduces a cap on resource allocation to respect max_resource_usage for operators. The changes include capping the shared resource allocation and adding a mechanism to redistribute any excess resources. The logic for capping is sound, and the addition of a new test case is good. I've found one potential issue with the redistribution logic that could violate the maximum resource constraint under certain conditions and have provided a suggestion to fix it.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refines the resource allocation logic within the ResourceManager by introducing a new helper method, _get_total_reserved, to calculate an operator's total reserved resources. The update_budgets method was enhanced to cap shared resource allocation for individual operators based on their max_resource_usage to prevent over-allocation. Additionally, any remaining shared resources are now distributed to the most downstream uncapped operator. New unit tests were added to validate these changes, covering scenarios where operators are capped by their maximum resource usage and ensuring correct distribution of leftover shared resources.
| # Total allocation = max(total_reserved, op_usage) + op_shared | ||
| # This ensures excess resources stay in remaining_shared for other operators. | ||
| _, max_resource_usage = op.min_max_resource_requirements() | ||
| if max_resource_usage is not None: |
There was a problem hiding this comment.
Here and on L912 -- when can max_resource_usage be None? I think the return type is tuple[ExecutionResources, ExecutionResources] (i.e., always non-None)?
If we are returning None in some places, I think we should return for_limits() instead. I think polymorphism makes the harder to read and reason about
| if op.min_max_resource_requirements()[1].gpu > 0: | ||
| if ( | ||
| max_resource_usage != ExecutionResources.inf() | ||
| and max_resource_usage.gpu > 0 |
There was a problem hiding this comment.
Bug: GPU allocation skipped for uncapped operators
The GPU allocation condition was changed from checking only max_resource_usage.gpu > 0 to also requiring max_resource_usage != ExecutionResources.inf(). Since ExecutionResources.inf() has gpu=float("inf") (which is > 0), the old code would allocate GPUs to operators with infinite resource limits. The new code excludes these "uncapped" operators from GPU allocation entirely, setting target_num_gpu = 0 for them. This behavioral change isn't mentioned in the PR description and could starve GPU resources for operators that don't have explicit resource requirements but still use GPUs.
There was a problem hiding this comment.
If I'm not mistaken, the old logic is wrong.
Because the default max requirement is inf, so this is always true.
cc @bveeramani
…ray-project#59412) This PR adds a cap on the total budget allocation by max_resource_usage in ReservationOpResourceAllocator. Previously, the reservation was capped by max_resource_usage, but the total budget (reservation + shared resources) could exceed the max. This could lead to operators being allocated more resources than they can use, while other operators are starved. Changes Cap total allocation by max_resource_usage: - Total allocation = max(total_reserved, op_usage) + op_shared - We now cap op_shared so that total allocation ≤ max_resource_usage - Excess shared resources remain in remaining_shared for other operators Redistribute remaining shared resources: - After the allocation loop, any remaining shared resources (from caps) are given to the most downstream uncapped operator - An operator is "uncapped" if its max_resource_usage == ExecutionResources.inf() - If all operators are capped, remaining resources are not redistributed Tests - test_budget_capped_by_max_resource_usage: Tests that capped operators don't receive excess shared resources, and remaining goes to uncapped downstream op - test_budget_capped_by_max_resource_usage_all_capped: Tests that when all operators are capped, remaining shared resources are not redistributed --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>
…ray-project#59412) This PR adds a cap on the total budget allocation by max_resource_usage in ReservationOpResourceAllocator. Previously, the reservation was capped by max_resource_usage, but the total budget (reservation + shared resources) could exceed the max. This could lead to operators being allocated more resources than they can use, while other operators are starved. Changes Cap total allocation by max_resource_usage: - Total allocation = max(total_reserved, op_usage) + op_shared - We now cap op_shared so that total allocation ≤ max_resource_usage - Excess shared resources remain in remaining_shared for other operators Redistribute remaining shared resources: - After the allocation loop, any remaining shared resources (from caps) are given to the most downstream uncapped operator - An operator is "uncapped" if its max_resource_usage == ExecutionResources.inf() - If all operators are capped, remaining resources are not redistributed Tests - test_budget_capped_by_max_resource_usage: Tests that capped operators don't receive excess shared resources, and remaining goes to uncapped downstream op - test_budget_capped_by_max_resource_usage_all_capped: Tests that when all operators are capped, remaining shared resources are not redistributed --------- Signed-off-by: Hao Chen <chenh1024@gmail.com>
#59412 updated the reservation resource allocator to cap resource allocations based on the resources returned by `min_max_resource_usage`. The problem is that running tasks can produce any amounts of data, so it doesn't make sense to cap by `obj_store_mem_max_pending_output_per_task * concurrency`. This PR fixes that issue by setting the max object store memory usage to 'inf'. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
) ray-project#59412 updated the reservation resource allocator to cap resource allocations based on the resources returned by `min_max_resource_usage`. The problem is that running tasks can produce any amounts of data, so it doesn't make sense to cap by `obj_store_mem_max_pending_output_per_task * concurrency`. This PR fixes that issue by setting the max object store memory usage to 'inf'. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
) ray-project#59412 updated the reservation resource allocator to cap resource allocations based on the resources returned by `min_max_resource_usage`. The problem is that running tasks can produce any amounts of data, so it doesn't make sense to cap by `obj_store_mem_max_pending_output_per_task * concurrency`. This PR fixes that issue by setting the max object store memory usage to 'inf'. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ray-project#59412) This PR adds a cap on the total budget allocation by max_resource_usage in ReservationOpResourceAllocator. Previously, the reservation was capped by max_resource_usage, but the total budget (reservation + shared resources) could exceed the max. This could lead to operators being allocated more resources than they can use, while other operators are starved. Changes Cap total allocation by max_resource_usage: - Total allocation = max(total_reserved, op_usage) + op_shared - We now cap op_shared so that total allocation ≤ max_resource_usage - Excess shared resources remain in remaining_shared for other operators Redistribute remaining shared resources: - After the allocation loop, any remaining shared resources (from caps) are given to the most downstream uncapped operator - An operator is "uncapped" if its max_resource_usage == ExecutionResources.inf() - If all operators are capped, remaining resources are not redistributed Tests - test_budget_capped_by_max_resource_usage: Tests that capped operators don't receive excess shared resources, and remaining goes to uncapped downstream op - test_budget_capped_by_max_resource_usage_all_capped: Tests that when all operators are capped, remaining shared resources are not redistributed --------- Signed-off-by: Hao Chen <chenh1024@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
) ray-project#59412 updated the reservation resource allocator to cap resource allocations based on the resources returned by `min_max_resource_usage`. The problem is that running tasks can produce any amounts of data, so it doesn't make sense to cap by `obj_store_mem_max_pending_output_per_task * concurrency`. This PR fixes that issue by setting the max object store memory usage to 'inf'. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
This PR adds a cap on the total budget allocation by max_resource_usage in ReservationOpResourceAllocator.
Previously, the reservation was capped by max_resource_usage, but the total budget (reservation + shared resources) could exceed the max. This could lead to operators being allocated more resources than they can use, while other operators are starved.
Changes
Cap total allocation by max_resource_usage:
Redistribute remaining shared resources:
Tests