[core][train] Ray Train disables blocking get inside async warning#56757
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new environment variable, WARN_BLOCKING_GET_INSIDE_ASYNC, to control whether a warning is logged when ray.get is used in an async actor. This change is motivated by Ray Train's legitimate use of this pattern. The implementation correctly modifies Ray Train to disable this warning by default, while allowing users to override this setting. The changes are well-structured and include a relevant unit test. My review includes one suggestion to refactor the warning logic for improved code clarity and to avoid a local import.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
israbbani
left a comment
There was a problem hiding this comment.
I'm a little concerned about adding environment variables to remove noisy log lines. Is the log line too spammy or is it confusing?
If it's confusing, can we amend the message to make it less misleading in your case? If it's too spammy, maybe we can reduce the frequency somehow?
| RAY_TRAIN_CALLBACKS_ENV_VAR = "RAY_TRAIN_CALLBACKS" | ||
|
|
||
| # Ray Train does not warn by default when using blocking ray.get inside async actor. | ||
| DEFAULT_WARN_BLOCKING_GET_INSIDE_ASYNC_VALUE = "0" |
There was a problem hiding this comment.
Why is this "0" instead of False?
There was a problem hiding this comment.
DEFAULT_WARN_BLOCKING_GET_INSIDE_ASYNC is a bool because it's the default value used by the env_bool function, which returns a bool
DEFAULT_WARN_BLOCKING_GET_INSIDE_ASYNC_VALUE is a string because DataParallelTrainer explicitly sets os.environ to it, and os.environ is a dict from strings to strings.
I agree it's confusing though so lmk if there's a cleaner way to do this.
I think it's confusing rather than spammy since some users have asked us what this means. Whether we should warn in the first place was already debatable (#11141 (review)) - maybe @edoakes can share more insight |
I see your point. It's confusing for the user because this isn't happening directly in their code, but inside Ray Train. It sounds like it's a legitimate use of For core though, exposing individual environment variables to turn off specific warnings isn't very maintainable. @edoakes is the warning actually useful? Can we amend it to make it less confusing or remove it completely if it's not useful? |
|
I think the warning is working as intended here -- you very likely shouldn't be blocking on I don't see any |
Ah my bad - the PR description was inaccurate. PTAL at the updated PR description. |
|
What does "that would be more trouble than it's worth" mean? This is a very clear anti-pattern when using |
Updated the PR description again. The tldr is:
|
|
I assume you mean cancelation of ongoing operations. If that's what you're trying to do, that makes it even more important to convert the |
Right now, the controller does a bunch of blocking things and Let me know if you have any suggestions on how to proceed. Thanks! |
edoakes
left a comment
There was a problem hiding this comment.
looks fine as temporary workaround. I would highly suggest avoiding making blocking calls in the asyncio loop in the future.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
…isable-async-get-warning
…56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
…ay-project#56757) # Summary Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running. However, Ray Train currently calls `ray.get` several times within the `Controller` async actor e.g. [when waiting for the placement group to be ready](https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/worker_group.py#L293). I tried replacing all of these calls with `awaits` but ultimately decided against it because doing so would be a large effort (see ray-project#54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group. Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion: ``` "Using blocking ray.get inside async actor. " "This blocks the event loop. Please use `await` " "on object ref with asyncio.gather if you want to " "yield execution to the event loop instead." ``` This PR * introduces a new `WARN_BLOCKING_GET_INSIDE_ASYNC` env var that toggles whether we `logger.warning` or `logger.debug`. This warns by default so it is a no-op for all non Ray Train use cases. * Ray Train sets this env var to "0" if it is not already set. Users can still flip the env var if they want. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Future-Outlier <eric901201@gmail.com>
Summary
Ray Train essentially has three parts: the driver, the controller actor, and the worker actors. We turned the controller into an async actor so that users can abort or get reported checkpoints from the controller while it is running.
However, Ray Train currently calls
ray.getseveral times within theControllerasync actor e.g. when waiting for the placement group to be ready. I tried replacing all of these calls withawaitsbut ultimately decided against it because doing so would be a large effort (see #54181 for some examples, including changing all our callbacks to be asyncio compatible) and require us to handle complex corner cases like controller abortion cleaning up an in-progress placement group.Ultimately we decided that this was fine because it enables the aforementioned operations (abort and get reported checkpoints) without introducing any behavior regressions (the ray.get's were already blocking before we made everything asyncio) other than showing Ray train users the warning below, which has caused confusion:
This PR
WARN_BLOCKING_GET_INSIDE_ASYNCenv var that toggles whether welogger.warningorlogger.debug. This warns by default so it is a no-op for all non Ray Train use cases.Testing
Unit tests