[Data] Fixing ActorPoolMapOperator to guarantee dispatch of all given inputs#60763
[Data] Fixing ActorPoolMapOperator to guarantee dispatch of all given inputs#60763alexeykudinkin merged 8 commits intomasterfrom
ActorPoolMapOperator to guarantee dispatch of all given inputs#60763Conversation
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request provides a significant and well-executed refactoring of ActorPoolMapOperator to guarantee liveness by aligning its input handling with the StreamingExecutor's protocol. The core change, ensuring input is only accepted when a task can be scheduled immediately via can_add_input(), is sound and addresses a key correctness issue. The refactoring of scheduling logic into _ActorTaskSelector and _ActorPool improves modularity. The test suite has been commendably updated to reflect these changes, including a new comprehensive test for the fixed liveness issue. My review identified a minor bug in a warning condition and an opportunity to clarify an assertion message for better debuggability. Overall, this is a high-quality contribution that enhances the robustness of Ray Data.
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Show resolved
Hide resolved
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Show resolved
Hide resolved
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
python/ray/data/_internal/execution/operators/actor_pool_map_operator.py
Show resolved
Hide resolved
|
|
||
| - This method should only return `True` when operator is guaranteed | ||
| to be able to launch a task, meaning that subsequent `op.add_input(...)` | ||
| should be able to launch a task. |
There was a problem hiding this comment.
can be handled later. The contract is kind of fragile, because there is no constraints on WHEN the next add_input will be called.
We should
- either make can_dadd_input and add_input atomic
- or introduce some boundaries at which the op's states can change
There was a problem hiding this comment.
So the contract is that:
- Before
add_input,can_add_inputmust be called (which is done when we're selecting operator to dispatch to) - This is enforced through assertions inside
add_inputcallingcan_add_input
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…en inputs (#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
…en inputs (#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
This change revisits
ActorPoolMapOperatorinput handling & scheduling sequence to align it with input handling protocol established in theStreamingExecutor-- inputs are only to be submitted when operator is believed to be ready to handle it, ie*While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the
op.add_input(...)calls will eventually trigger task scheduling that will handle all of the previously provided inputs.This however is not the case for APMO:
Changes
To work around these issues and guarantee liveness for
ActorPoolMapOperatorfollowing changes are implemented:APMO is aligned with task submission protocol
New inputs are submitted to the operator only when APMO is able to schedule new task immediately (verified t/h
op.can_accept_input()).If it's not able to schedule the task immediately input is rejected and is kept in the external input queue.
Revisited _ActorTaskSelector to keep it in sync with the Operator
Currently
_ActorTaskSchedulermight depend on an external state. This is problematic for the following reasons:op.add_input(...)is invoked then handling protocol will be violated.To work this problem around we're snapshotting all external state inside
_ActorTaskScheduler.refresh_state(...)method:can_schedule_taskandselect_actorsto get out of syncRelated issues
Additional information