[core][train] Ray Train disables blocking get inside async warning by TimothySeah · Pull Request #56757 · ray-project/ray

TimothySeah · 2025-09-19T22:39:19Z

Summary

Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running.

However, Ray Train currently calls ray.get several times within the Controller async actor e.g. when waiting for the placement group to be ready. I tried replacing all of these calls with awaits but ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group.

Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion:

"Using blocking ray.get inside async actor. "
"This blocks the event loop. Please use `await` "
"on object ref with asyncio.gather if you want to "
"yield execution to the event loop instead."

This PR

introduces a new WARN_BLOCKING_GET_INSIDE_ASYNC env var that toggles whether we logger.warning or logger.debug. This warns by default so it is a no-op for all non Ray Train use cases.
Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a new environment variable, WARN_BLOCKING_GET_INSIDE_ASYNC, to control whether a warning is logged when ray.get is used in an async actor. This change is motivated by Ray Train's legitimate use of this pattern. The implementation correctly modifies Ray Train to disable this warning by default, while allowing users to override this setting. The changes are well-structured and include a relevant unit test. My review includes one suggestion to refactor the warning logic for improved code clarity and to avoid a local import.

python/ray/_private/worker.py

Signed-off-by: Timothy Seah <tseah@anyscale.com>

israbbani

I'm a little concerned about adding environment variables to remove noisy log lines. Is the log line too spammy or is it confusing?

If it's confusing, can we amend the message to make it less misleading in your case? If it's too spammy, maybe we can reduce the frequency somehow?

python/ray/_private/worker.py

israbbani · 2025-09-22T22:59:08Z

python/ray/train/v2/_internal/constants.py

 RAY_TRAIN_CALLBACKS_ENV_VAR = "RAY_TRAIN_CALLBACKS"

+# Ray Train does not warn by default when using blocking ray.get inside async actor.
+DEFAULT_WARN_BLOCKING_GET_INSIDE_ASYNC_VALUE = "0"


Why is this "0" instead of False?

DEFAULT_WARN_BLOCKING_GET_INSIDE_ASYNC is a bool because it's the default value used by the env_bool function, which returns a bool

DEFAULT_WARN_BLOCKING_GET_INSIDE_ASYNC_VALUE is a string because DataParallelTrainer explicitly sets os.environ to it, and os.environ is a dict from strings to strings.

I agree it's confusing though so lmk if there's a cleaner way to do this.

TimothySeah · 2025-09-23T01:37:40Z

I'm a little concerned about adding environment variables to remove noisy log lines. Is the log line too spammy or is it confusing?

If it's confusing, can we amend the message to make it less misleading in your case? If it's too spammy, maybe we can reduce the frequency somehow?

I think it's confusing rather than spammy since some users have asked us what this means. Whether we should warn in the first place was already debatable (#11141 (review)) - maybe @edoakes can share more insight

israbbani · 2025-09-23T02:24:12Z

I'm a little concerned about adding environment variables to remove noisy log lines. Is the log line too spammy or is it confusing?
If it's confusing, can we amend the message to make it less misleading in your case? If it's too spammy, maybe we can reduce the frequency somehow?

I think it's confusing rather than spammy since some users have asked us what this means. Whether we should warn in the first place was already debatable (#11141 (review)) - maybe @edoakes can share more insight

I see your point. It's confusing for the user because this isn't happening directly in their code, but inside Ray Train. It sounds like it's a legitimate use of ray.get from Ray Train's point of view as well.

For core though, exposing individual environment variables to turn off specific warnings isn't very maintainable. @edoakes is the warning actually useful? Can we amend it to make it less confusing or remove it completely if it's not useful?

edoakes · 2025-09-23T11:37:54Z

I think the warning is working as intended here -- you very likely shouldn't be blocking on ray.get in that method if it's used inside of an async actor.

I don't see any async def methods on the class you linked. Is this embedded in another class/outer actor definition that does use asyncio?

TimothySeah · 2025-09-23T18:35:36Z

I think the warning is working as intended here -- you very likely shouldn't be blocking on ray.get in that method if it's used inside of an async actor.

I don't see any async def methods on the class you linked. Is this embedded in another class/outer actor definition that does use asyncio?

Ah my bad - the PR description was inaccurate. PTAL at the updated PR description.

edoakes · 2025-09-23T20:11:44Z

What does "that would be more trouble than it's worth" mean? This is a very clear anti-pattern when using asyncio

TimothySeah · 2025-09-23T20:38:25Z

What does "that would be more trouble than it's worth" mean? This is a very clear anti-pattern when using asyncio

Updated the PR description again. The tldr is:

Controller actor already made a bunch of blocking ray.get calls
We turned the controller actor async to enable abortion, which did not introduce any behavior regressions other than the warning.
I tried to remove all the ray.get's from the controller but that proved very difficult e.g. it would require us to turn all of our callbacks async and clean up in-progress placement groups.

edoakes · 2025-09-23T21:24:42Z

I assume you mean cancelation of ongoing operations. If that's what you're trying to do, that makes it even more important to convert the ray.get calls -- otherwise they won't be cancelable...

TimothySeah · 2025-09-23T22:10:56Z

I assume you mean cancelation of ongoing operations. If that's what you're trying to do, that makes it even more important to convert the ray.get calls -- otherwise they won't be cancelable...

Right now, the controller does a bunch of blocking things and asyncio.sleeps for 2 seconds at a time. Abortion can happen during those 2 seconds. Before making the controller async, it couldn't happen at all, so this is a strict improvement over the previous behavior other than the warning. Meanwhile, converting all the ray.get's to awaits would be a long effort (at least a week), require a lot of testing, and may introduce bugs. This PR is a short term solution to stop user confusion from the warning message.

Let me know if you have any suggestions on how to proceed. Thanks!

edoakes

looks fine as temporary workaround. I would highly suggest avoiding making blocking calls in the asyncio loop in the future.

python/ray/_private/ray_constants.py

python/ray/_private/worker.py

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…isable-async-get-warning

python/ray/_private/worker.py

…56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>

[core][train] Ray Train disables blocking get inside async warning

f424314

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested review from a team as code owners September 19, 2025 22:39

gemini-code-assist bot reviewed Sep 19, 2025

View reviewed changes

python/ray/_private/worker.py Outdated Show resolved Hide resolved

address code review comment

28c2fed

Signed-off-by: Timothy Seah <tseah@anyscale.com>

ray-gardener bot added train Ray Train Related Issue core Issues that should be addressed in Ray Core labels Sep 20, 2025

israbbani reviewed Sep 22, 2025

View reviewed changes

TimothySeah requested a review from israbbani September 23, 2025 01:37

edoakes reviewed Sep 24, 2025

View reviewed changes

python/ray/_private/ray_constants.py Outdated Show resolved Hide resolved

python/ray/_private/ray_constants.py Outdated Show resolved Hide resolved

python/ray/_private/worker.py Outdated Show resolved Hide resolved

address pr comments

2e12078

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from jjyao as a code owner September 24, 2025 01:52

TimothySeah requested a review from edoakes September 24, 2025 01:54

This comment was marked as outdated.

Sign in to view

matthewdeng approved these changes Sep 24, 2025

View reviewed changes

edoakes added the go add ONLY when ready to merge, run all tests label Sep 24, 2025

edoakes approved these changes Sep 24, 2025

View reviewed changes

TimothySeah added 2 commits September 24, 2025 11:10

address cursor comment

85856c4

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tseah/ray-train-d…

2304f13

…isable-async-get-warning

cursor bot reviewed Sep 24, 2025

View reviewed changes

python/ray/_private/worker.py Show resolved Hide resolved

matthewdeng merged commit 68b1e8a into ray-project:master Sep 24, 2025
6 checks passed

Conversation

TimothySeah commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

israbbani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

israbbani Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

TimothySeah Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

TimothySeah commented Sep 23, 2025

Uh oh!

israbbani commented Sep 23, 2025

Uh oh!

edoakes commented Sep 23, 2025

Uh oh!

TimothySeah commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edoakes commented Sep 23, 2025

Uh oh!

TimothySeah commented Sep 23, 2025

Uh oh!

edoakes commented Sep 23, 2025

Uh oh!

TimothySeah commented Sep 23, 2025

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TimothySeah commented Sep 19, 2025 •

edited

Loading

TimothySeah commented Sep 23, 2025 •

edited

Loading