Skip to content

[Data] - Don't reserve GPU budget for non-GPU tasks#59789

Merged
alexeykudinkin merged 7 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/no_gpu_budget_for_non_gpu
Jan 6, 2026
Merged

[Data] - Don't reserve GPU budget for non-GPU tasks#59789
alexeykudinkin merged 7 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/no_gpu_budget_for_non_gpu

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Dec 31, 2025

Description

Follow up for this PR: #59632 (comment)

Only assign GPU budget if the operator requires it.

Image classification Release Test: https://buildkite.com/ray-project/release/builds/73917#019b90b1-2a29-424f-861b-8715909fe02e

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner December 31, 2025 19:07
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Dec 31, 2025
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini summary

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a sensible change to prevent non-GPU operators from reserving GPU budget, which is particularly important for unbounded operators. The changes in ActorPoolMapOperator and TaskPoolMapOperator correctly adjust the maximum resource requirements based on whether the operator uses GPUs. A comprehensive test case has been added to validate this behavior in a realistic inference pipeline, ensuring that GPU resources are allocated efficiently. The implementation is clean and the test is thorough. I've added one minor suggestion to refactor the test setup for better readability.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request refines Ray Data's resource allocation mechanism to prevent non-GPU tasks from being assigned GPU budget. By explicitly capping GPU requirements to zero for CPU-only operators, the system ensures that valuable GPU resources are exclusively available for tasks that genuinely need them, leading to improved resource efficiency and better performance for mixed-workload execution.

Highlights

  • GPU Budget Allocation Logic: Modified the resource allocation logic for ActorPoolMapOperator and TaskPoolMapOperator to explicitly set GPU budget to zero for operators that do not require GPUs, even when using infinite resource limits.
  • Resource Hoarding Prevention: This change prevents non-GPU tasks from inadvertently 'hoarding' or reserving GPU resources, ensuring more efficient utilization of GPU hardware.
  • New Test Case: Introduced a comprehensive test case (test_gpu_not_reserved_for_non_gpu_operators) to validate that GPU resources are correctly reserved only for GPU-dependent operators in a simulated inference pipeline.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
    • Updated min_max_resource_requirements to set gpu=0 in ExecutionResources.for_limits() if num_gpus_per_actor is zero, preventing non-GPU actors from reserving GPU budget.
  • python/ray/data/_internal/execution/operators/task_pool_map_operator.py
    • Updated min_max_resource_requirements to set gpu=0 in ExecutionResources.for_limits() if per_task.gpu is zero, preventing non-GPU tasks from reserving GPU budget.
  • python/ray/data/tests/test_reservation_based_resource_allocator.py
    • Added test_gpu_not_reserved_for_non_gpu_operators to verify that CPU-only operators (e.g., Read, Write) are allocated zero GPUs, while GPU-dependent operators (e.g., Infer1, Infer2) correctly reserve GPUs in a pipeline.
Activity
  • goutamvenkat-anyscale requested a summary of the pull request.
  • gemini-code-assist[bot] provided a suggestion to refactor the new test case by defining the list of operators once to improve readability and avoid repetition.

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale changed the title [Data] - Don't allocate GPU budget for non-GPU tasks [Data] - Don't reserve GPU budget for non-GPU tasks Dec 31, 2025
Signed-off-by: Goutam <goutam@anyscale.com>
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except comment about CPUs

Comment on lines 481 to 485
# Use infinite limits, but cap GPU to 0 if this operator doesn't use GPUs.
# This prevents non-GPU operators from hoarding GPU budget.
max_resource_usage = ExecutionResources.for_limits(
gpu=None if num_gpus_per_actor else 0
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this implementation special-cases GPUs because it assumes that all tasks/actors require logical CPUs and memory, but I don't think that assumption holds.

For example, here's a common thing users do:

ds.map_batches(Inference, num_gpus=1, batch_size=...)

In this case, I don't think the Inference actors request any logical CPUs.

For this reason, should we also include CPUs and memory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also simplify this conditional to be just:

  gpu=0 if num_gpus == 0 else max_actors * num_gpus

Comment on lines 196 to 200
# Use infinite limits, but cap GPU to 0 if this operator doesn't use GPUs.
# This prevents non-GPU operators from hoarding GPU budget.
max_resource_usage = ExecutionResources.for_limits(
gpu=None if per_task.gpu else 0
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of out-of-scope for this PR, but this might cause issues if users (or optimization rules like ConfigureMapTaskMemoryRule) specify ray_remote_args_fn.

For example, if a user does:

ds.map_batches(..., ray_remote_args_fn=lambda: {"num_cpus": 10})

Then the max resource usage will be num_cpus=1, but each task requires 10 CPUs.

An easy fix might be to consider ray_remote_args_fn when returning incremental_resource_usage

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) January 6, 2026 04:05
@alexeykudinkin alexeykudinkin merged commit 9e2de8d into ray-project:master Jan 6, 2026
7 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/no_gpu_budget_for_non_gpu branch January 8, 2026 19:25
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
lee1258561 pushed a commit to pinterest/ray that referenced this pull request Feb 3, 2026
Signed-off-by: lee1258561 <lee1258561@gmail.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants