[train][data] Fix iter_torch_batches usage of `ray.train.torch.get_device` when running outside Ray Train by justinvyu · Pull Request #57816 · ray-project/ray

justinvyu · 2025-10-16T22:29:38Z

Description

Train V2 doesn't allow running ray.train.torch.get_device outside of a Ray Train worker spawned by a trainer.fit() call. Previously, get_device() returns the 0th index GPU of the ray.get_gpu_ids() assigned to the current process, or "cpu" if the current process wasn't assigned GPUs via ray.remote(num_gpus=x).

This PR introduces a utility to detect whether we're running inside a Ray Train worker process or not (in v1 and v2) and updates Ray Data's iter_torch_batches to only call get_device() if in a Train worker process.

This introduces a slight API breakage for users who spawned a custom GPU Ray task and used iter_torch_batches or get_device().

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a utility _in_ray_train_worker to correctly detect whether code is running within a Ray Train worker, for both V1 and V2. This utility is then used in iter_torch_batches to conditionally call ray.train.torch.get_device, fixing an issue where get_device would fail when used outside a Ray Train context with Train V2. The changes are well-structured, with separate logic for V1 and V2, and are accompanied by good test coverage. My feedback includes a couple of minor suggestions to improve code style and maintainability.

python/ray/train/_internal/session.py

python/ray/train/utils.py

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

python/ray/train/_internal/session.py

srinathk10 · 2025-10-17T19:52:29Z

python/ray/data/iterator.py

            # Use the appropriate device for Ray Train, or falls back to CPU if
            # Ray Train is not being used.
-            device = get_device()
+            device = get_device() if _in_ray_train_worker() else "cpu"


nit: Would it be possible roll this condition into get_device?

I considered this, but our current stance is that we don't want people calling get_device() (Train worker utils) outside of Train workers (it will error in V2), so I fully gate the call to this method.

…vice` when running outside Ray Train (ray-project#57816) Train V2 doesn't allow running `ray.train.torch.get_device` outside of a Ray Train worker spawned by a trainer.fit() call. Previously, `get_device()` returns the 0th index GPU of the `ray.get_gpu_ids()` assigned to the current process, or "cpu" if the current process wasn't assigned GPUs via `ray.remote(num_gpus=x)`. This PR introduces a utility to detect whether we're running inside a Ray Train worker process or not (in v1 and v2) and updates Ray Data's iter_torch_batches to only call `get_device()` if in a Train worker process. This introduces a slight API breakage for users who spawned a custom GPU Ray task and used `iter_torch_batches` or `get_device()`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…vice` when running outside Ray Train (ray-project#57816) Train V2 doesn't allow running `ray.train.torch.get_device` outside of a Ray Train worker spawned by a trainer.fit() call. Previously, `get_device()` returns the 0th index GPU of the `ray.get_gpu_ids()` assigned to the current process, or "cpu" if the current process wasn't assigned GPUs via `ray.remote(num_gpus=x)`. This PR introduces a utility to detect whether we're running inside a Ray Train worker process or not (in v1 and v2) and updates Ray Data's iter_torch_batches to only call `get_device()` if in a Train worker process. This introduces a slight API breakage for users who spawned a custom GPU Ray task and used `iter_torch_batches` or `get_device()`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>

…vice` when running outside Ray Train (#57816) Train V2 doesn't allow running `ray.train.torch.get_device` outside of a Ray Train worker spawned by a trainer.fit() call. Previously, `get_device()` returns the 0th index GPU of the `ray.get_gpu_ids()` assigned to the current process, or "cpu" if the current process wasn't assigned GPUs via `ray.remote(num_gpus=x)`. This PR introduces a utility to detect whether we're running inside a Ray Train worker process or not (in v1 and v2) and updates Ray Data's iter_torch_batches to only call `get_device()` if in a Train worker process. This introduces a slight API breakage for users who spawned a custom GPU Ray task and used `iter_torch_batches` or `get_device()`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…vice` when running outside Ray Train (ray-project#57816) Train V2 doesn't allow running `ray.train.torch.get_device` outside of a Ray Train worker spawned by a trainer.fit() call. Previously, `get_device()` returns the 0th index GPU of the `ray.get_gpu_ids()` assigned to the current process, or "cpu" if the current process wasn't assigned GPUs via `ray.remote(num_gpus=x)`. This PR introduces a utility to detect whether we're running inside a Ray Train worker process or not (in v1 and v2) and updates Ray Data's iter_torch_batches to only call `get_device()` if in a Train worker process. This introduces a slight API breakage for users who spawned a custom GPU Ray task and used `iter_torch_batches` or `get_device()`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…vice` when running outside Ray Train (ray-project#57816) Train V2 doesn't allow running `ray.train.torch.get_device` outside of a Ray Train worker spawned by a trainer.fit() call. Previously, `get_device()` returns the 0th index GPU of the `ray.get_gpu_ids()` assigned to the current process, or "cpu" if the current process wasn't assigned GPUs via `ray.remote(num_gpus=x)`. This PR introduces a utility to detect whether we're running inside a Ray Train worker process or not (in v1 and v2) and updates Ray Data's iter_torch_batches to only call `get_device()` if in a Train worker process. This introduces a slight API breakage for users who spawned a custom GPU Ray task and used `iter_torch_batches` or `get_device()`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

…vice` when running outside Ray Train (ray-project#57816) Train V2 doesn't allow running `ray.train.torch.get_device` outside of a Ray Train worker spawned by a trainer.fit() call. Previously, `get_device()` returns the 0th index GPU of the `ray.get_gpu_ids()` assigned to the current process, or "cpu" if the current process wasn't assigned GPUs via `ray.remote(num_gpus=x)`. This PR introduces a utility to detect whether we're running inside a Ray Train worker process or not (in v1 and v2) and updates Ray Data's iter_torch_batches to only call `get_device()` if in a Train worker process. This introduces a slight API breakage for users who spawned a custom GPU Ray task and used `iter_torch_batches` or `get_device()`. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

justinvyu added 4 commits October 16, 2025 15:15

in_ray_train_worker

5a1b591

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix iter torch batches auto gpu

c809c88

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add test

90f59c6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

add v1 test

ca01c78

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu assigned matthewdeng and xinyuangui2 Oct 16, 2025

justinvyu requested review from a team as code owners October 16, 2025 22:29

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

python/ray/train/_internal/session.py Show resolved Hide resolved

python/ray/train/utils.py Show resolved Hide resolved

matthewdeng approved these changes Oct 16, 2025

View reviewed changes

ray-gardener bot added train Ray Train Related Issue data Ray Data-related issues labels Oct 17, 2025

justinvyu added 2 commits October 17, 2025 10:58

add ray start fixture

54df9bd

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

fix iter torch batches gpu

b78ccc4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

cursor bot reviewed Oct 17, 2025

View reviewed changes

python/ray/train/_internal/session.py Show resolved Hide resolved

justinvyu added the go add ONLY when ready to merge, run all tests label Oct 17, 2025

xinyuangui2 approved these changes Oct 17, 2025

View reviewed changes

srinathk10 approved these changes Oct 17, 2025

View reviewed changes

justinvyu enabled auto-merge (squash) October 17, 2025 20:11

justinvyu merged commit aff37bf into ray-project:master Oct 17, 2025
8 checks passed

justinvyu deleted the fix_data_get_device branch October 17, 2025 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train][data] Fix iter_torch_batches usage of `ray.train.torch.get_device` when running outside Ray Train#57816

[train][data] Fix iter_torch_batches usage of `ray.train.torch.get_device` when running outside Ray Train#57816
justinvyu merged 6 commits intoray-project:masterfrom
justinvyu:fix_data_get_device

justinvyu commented Oct 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srinathk10 Oct 17, 2025

Uh oh!

justinvyu Oct 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

justinvyu commented Oct 16, 2025

Description

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srinathk10 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants