Skip to content

[core][train] Ray Train disables blocking get inside async warning#56757

Merged
matthewdeng merged 5 commits intoray-project:masterfrom
TimothySeah:tseah/ray-train-disable-async-get-warning
Sep 24, 2025
Merged

[core][train] Ray Train disables blocking get inside async warning#56757
matthewdeng merged 5 commits intoray-project:masterfrom
TimothySeah:tseah/ray-train-disable-async-get-warning

Conversation

@TimothySeah
Copy link
Contributor

@TimothySeah TimothySeah commented Sep 19, 2025

Summary

Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running.

However, Ray Train currently calls ray.get several times within the Controller async actor e.g. when waiting for the placement group to be ready. I tried replacing all of these calls with awaits but ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion:

"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."

This PR

  • introduces a new WARN_BLOCKING_GET_INSIDE_ASYNC env var that toggles whether we logger.warning or logger.debug. This warns by default so it is a no-op for all non Ray Train use cases.
  • Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested review from a team as code owners September 19, 2025 22:39
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new environment variable, WARN_BLOCKING_GET_INSIDE_ASYNC, to control whether a warning is logged when ray.get is used in an async actor. This change is motivated by Ray Train's legitimate use of this pattern. The implementation correctly modifies Ray Train to disable this warning by default, while allowing users to override this setting. The changes are well-structured and include a relevant unit test. My review includes one suggestion to refactor the warning logic for improved code clarity and to avoid a local import.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@ray-gardener ray-gardener bot added train Ray Train Related Issue core Issues that should be addressed in Ray Core labels Sep 20, 2025
Copy link
Contributor

@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about adding environment variables to remove noisy log lines. Is the log line too spammy or is it confusing?

If it's confusing, can we amend the message to make it less misleading in your case? If it's too spammy, maybe we can reduce the frequency somehow?

RAY_TRAIN_CALLBACKS_ENV_VAR = "RAY_TRAIN_CALLBACKS"

# Ray Train does not warn by default when using blocking ray.get inside async actor.
DEFAULT_WARN_BLOCKING_GET_INSIDE_ASYNC_VALUE = "0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this "0" instead of False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_WARN_BLOCKING_GET_INSIDE_ASYNC is a bool because it's the default value used by the env_bool function, which returns a bool

DEFAULT_WARN_BLOCKING_GET_INSIDE_ASYNC_VALUE is a string because DataParallelTrainer explicitly sets os.environ to it, and os.environ is a dict from strings to strings.

I agree it's confusing though so lmk if there's a cleaner way to do this.

@TimothySeah
Copy link
Contributor Author

I'm a little concerned about adding environment variables to remove noisy log lines. Is the log line too spammy or is it confusing?

If it's confusing, can we amend the message to make it less misleading in your case? If it's too spammy, maybe we can reduce the frequency somehow?

I think it's confusing rather than spammy since some users have asked us what this means. Whether we should warn in the first place was already debatable (#11141 (review)) - maybe @edoakes can share more insight

@israbbani
Copy link
Contributor

I'm a little concerned about adding environment variables to remove noisy log lines. Is the log line too spammy or is it confusing?
If it's confusing, can we amend the message to make it less misleading in your case? If it's too spammy, maybe we can reduce the frequency somehow?

I think it's confusing rather than spammy since some users have asked us what this means. Whether we should warn in the first place was already debatable (#11141 (review)) - maybe @edoakes can share more insight

I see your point. It's confusing for the user because this isn't happening directly in their code, but inside Ray Train. It sounds like it's a legitimate use of ray.get from Ray Train's point of view as well.

For core though, exposing individual environment variables to turn off specific warnings isn't very maintainable. @edoakes is the warning actually useful? Can we amend it to make it less confusing or remove it completely if it's not useful?

@edoakes
Copy link
Collaborator

edoakes commented Sep 23, 2025

I think the warning is working as intended here -- you very likely shouldn't be blocking on ray.get in that method if it's used inside of an async actor.

I don't see any async def methods on the class you linked. Is this embedded in another class/outer actor definition that does use asyncio?

@TimothySeah
Copy link
Contributor Author

TimothySeah commented Sep 23, 2025

I think the warning is working as intended here -- you very likely shouldn't be blocking on ray.get in that method if it's used inside of an async actor.

I don't see any async def methods on the class you linked. Is this embedded in another class/outer actor definition that does use asyncio?

Ah my bad - the PR description was inaccurate. PTAL at the updated PR description.

@edoakes
Copy link
Collaborator

edoakes commented Sep 23, 2025

What does "that would be more trouble than it's worth" mean? This is a very clear anti-pattern when using asyncio

@TimothySeah
Copy link
Contributor Author

What does "that would be more trouble than it's worth" mean? This is a very clear anti-pattern when using asyncio

Updated the PR description again. The tldr is:

  • Controller actor already made a bunch of blocking ray.get calls
  • We turned the controller actor async to enable abortion, which did not introduce any behavior regressions other than the warning.
  • I tried to remove all the ray.get's from the controller but that proved very difficult e.g. it would require us to turn all of our callbacks async and clean up in-progress placement groups.

@edoakes
Copy link
Collaborator

edoakes commented Sep 23, 2025

I assume you mean cancelation of ongoing operations. If that's what you're trying to do, that makes it even more important to convert the ray.get calls -- otherwise they won't be cancelable...

@TimothySeah
Copy link
Contributor Author

I assume you mean cancelation of ongoing operations. If that's what you're trying to do, that makes it even more important to convert the ray.get calls -- otherwise they won't be cancelable...

Right now, the controller does a bunch of blocking things and asyncio.sleeps for 2 seconds at a time. Abortion can happen during those 2 seconds. Before making the controller async, it couldn't happen at all, so this is a strict improvement over the previous behavior other than the warning. Meanwhile, converting all the ray.get's to awaits would be a long effort (at least a week), require a lot of testing, and may introduce bugs. This PR is a short term solution to stop user confusion from the warning message.

Let me know if you have any suggestions on how to proceed. Thanks!

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine as temporary workaround. I would highly suggest avoiding making blocking calls in the asyncio loop in the future.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
cursor[bot]

This comment was marked as outdated.

@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Sep 24, 2025
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@matthewdeng matthewdeng merged commit 68b1e8a into ray-project:master Sep 24, 2025
6 checks passed
elliot-barn pushed a commit that referenced this pull request Sep 27, 2025
…56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
…56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…ay-project#56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
ray-project#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ay-project#56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
ray-project#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
…ay-project#56757)

# Summary

Ray Train essentially has three parts: the driver, the controller actor,
and the worker actors. We turned the controller into an async actor so
that users can abort or get reported checkpoints from the controller
while it is running.

However, Ray Train currently calls `ray.get` several times within the
`Controller` async actor e.g. [when waiting for the placement group to
be
ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293).
I tried replacing all of these calls with `awaits` but ultimately
decided against it because doing so would be a large effort (see
ray-project#54181 for some examples,
including changing all our callbacks to be asyncio compatible) and
require us to handle complex corner cases like controller abortion
cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the
aforementioned operations (abort and get reported checkpoints) without
introducing any behavior regressions (the ray.get's were already
blocking before we made everything asyncio) other than showing Ray train
users the warning below, which has caused confusion:

```
"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."
```

This PR
* introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles
whether we `logger.warning` or `logger.debug`. This warns by default so
it is a no-op for all non Ray Train use cases.
* Ray Train sets this env var to "0" if it is not already set. Users can
still flip the env var if they want.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants