[Data] - Don't reserve GPU budget for non-GPU tasks#59789
Conversation
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini summary |
There was a problem hiding this comment.
Code Review
This pull request introduces a sensible change to prevent non-GPU operators from reserving GPU budget, which is particularly important for unbounded operators. The changes in ActorPoolMapOperator and TaskPoolMapOperator correctly adjust the maximum resource requirements based on whether the operator uses GPUs. A comprehensive test case has been added to validate this behavior in a realistic inference pipeline, ensuring that GPU resources are allocated efficiently. The implementation is clean and the test is thorough. I've added one minor suggestion to refactor the test setup for better readability.
python/ray/data/tests/test_reservation_based_resource_allocator.py
Outdated
Show resolved
Hide resolved
Summary of ChangesThis pull request refines Ray Data's resource allocation mechanism to prevent non-GPU tasks from being assigned GPU budget. By explicitly capping GPU requirements to zero for CPU-only operators, the system ensures that valuable GPU resources are exclusively available for tasks that genuinely need them, leading to improved resource efficiency and better performance for mixed-workload execution. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
|
bveeramani
left a comment
There was a problem hiding this comment.
LGTM except comment about CPUs
| # Use infinite limits, but cap GPU to 0 if this operator doesn't use GPUs. | ||
| # This prevents non-GPU operators from hoarding GPU budget. | ||
| max_resource_usage = ExecutionResources.for_limits( | ||
| gpu=None if num_gpus_per_actor else 0 | ||
| ) |
There was a problem hiding this comment.
I think this implementation special-cases GPUs because it assumes that all tasks/actors require logical CPUs and memory, but I don't think that assumption holds.
For example, here's a common thing users do:
ds.map_batches(Inference, num_gpus=1, batch_size=...)In this case, I don't think the Inference actors request any logical CPUs.
For this reason, should we also include CPUs and memory?
There was a problem hiding this comment.
Also simplify this conditional to be just:
gpu=0 if num_gpus == 0 else max_actors * num_gpus
| # Use infinite limits, but cap GPU to 0 if this operator doesn't use GPUs. | ||
| # This prevents non-GPU operators from hoarding GPU budget. | ||
| max_resource_usage = ExecutionResources.for_limits( | ||
| gpu=None if per_task.gpu else 0 | ||
| ) |
There was a problem hiding this comment.
Sort of out-of-scope for this PR, but this might cause issues if users (or optimization rules like ConfigureMapTaskMemoryRule) specify ray_remote_args_fn.
For example, if a user does:
ds.map_batches(..., ray_remote_args_fn=lambda: {"num_cpus": 10})Then the max resource usage will be num_cpus=1, but each task requires 10 CPUs.
An easy fix might be to consider ray_remote_args_fn when returning incremental_resource_usage
Signed-off-by: Goutam <goutam@anyscale.com>
python/ray/data/_internal/execution/operators/task_pool_map_operator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Show resolved
Hide resolved
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
Signed-off-by: lee1258561 <lee1258561@gmail.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Follow up for this PR: #59632 (comment)
Only assign GPU budget if the operator requires it.
Image classification Release Test: https://buildkite.com/ray-project/release/builds/73917#019b90b1-2a29-424f-861b-8715909fe02e
Related issues
Additional information