[Data] Fixing `ActorPoolMapOperator` to guarantee dispatch of all given inputs by alexeykudinkin · Pull Request #60763 · ray-project/ray

alexeykudinkin · 2026-02-05T02:47:03Z

Description

This change revisits ActorPoolMapOperator input handling & scheduling sequence to align it with input handling protocol established in the StreamingExecutor -- inputs are only to be submitted when operator is believed to be ready to handle it, ie

It has resource budget
When task could be launched*

*While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the op.add_input(...) calls will eventually trigger task scheduling that will handle all of the previously provided inputs.

This however is not the case for APMO:

APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc
When APMO refuses scheduling, it enqueues provided input bundle into its own internal queue. However, draining of that queue could not be guaranteed with the current execution model.

Changes

To work around these issues and guarantee liveness for ActorPoolMapOperator following changes are implemented:

APMO is aligned with task submission protocol

New inputs are submitted to the operator only when APMO is able to schedule new task immediately (verified t/h op.can_accept_input()).

If it's not able to schedule the task immediately input is rejected and is kept in the external input queue.

Revisited _ActorTaskSelector to keep it in sync with the Operator

Currently _ActorTaskScheduler might depend on an external state. This is problematic for the following reasons:

This state is used to determine actors that we can safely route to
This is used to determine whether APMO can schedule a task to run (see above)
However, if state changes between the check and when op.add_input(...) is invoked then handling protocol will be violated.

To work this problem around we're snapshotting all external state inside _ActorTaskScheduler.refresh_state(...) method:

State is snapshotted (refreshed periodically)
This way state is synchronized with the other Operator's state
This makes it impossible for can_schedule_task and select_actors to get out of sync

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

gemini-code-assist

Code Review

This pull request provides a significant and well-executed refactoring of ActorPoolMapOperator to guarantee liveness by aligning its input handling with the StreamingExecutor's protocol. The core change, ensuring input is only accepted when a task can be scheduled immediately via can_add_input(), is sound and addresses a key correctness issue. The refactoring of scheduling logic into _ActorTaskSelector and _ActorPool improves modularity. The test suite has been commendably updated to reflect these changes, including a new comprehensive test for the fixed liveness issue. My review identified a minor bug in a warning condition and an opportunity to clarify an assertion message for better debuggability. Overall, this is a high-quality contribution that enhances the robustness of Ray Data.

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

raulchen · 2026-02-05T04:24:43Z

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py

+
+            - This method should only return `True` when operator is guaranteed
+            to be able to launch a task, meaning that subsequent `op.add_input(...)`
+            should be able to launch a task.


can be handled later. The contract is kind of fragile, because there is no constraints on WHEN the next add_input will be called.
We should

either make can_dadd_input and add_input atomic

or introduce some boundaries at which the op's states can change

So the contract is that:

Before add_input, can_add_input must be called (which is done when we're selecting operator to dispatch to)

This is enforced through assertions inside add_input calling can_add_input

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>

…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…en inputs (#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…en inputs (#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>

…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…en inputs (ray-project#60763) ## Description This change revisits `ActorPoolMapOperator` input handling & scheduling sequence to align it with input handling protocol established in the `StreamingExecutor` -- inputs are only to be submitted when operator is **believed to be ready to handle it**, ie - It has resource budget - When task _could be_ launched* *While operator might not immediately launch the task due to "bundling" multiple inputs together, it's still expected that one of the `op.add_input(...)` calls will eventually trigger task scheduling that will handle **all of the previously provided inputs**. This however is not the case for APMO: 1. APMO can refuse scheduling: for ex, when actors are fully utilized, when actors are restarting, etc 2. When APMO refuses scheduling, it enqueues provided input bundle into its _own internal queue_. However, draining of that queue *could not be guaranteed* with the current execution model. Changes --- To work around these issues and guarantee liveness for `ActorPoolMapOperator` following changes are implemented: ### APMO is aligned with task submission protocol New inputs are submitted to the operator **only** when APMO is able to schedule new task **immediately** (verified t/h `op.can_accept_input()`). If it's not able to schedule the task immediately input is rejected and is kept in the external input queue. ### Revisited _ActorTaskSelector to keep it in sync with the Operator Currently `_ActorTaskScheduler` might depend on an external state. This is problematic for the following reasons: - This state is used to determine actors that we can safely route to - This is used to determine whether APMO can schedule a task to run (see above) - However, if state changes between the check and when `op.add_input(...)` is invoked then handling protocol will be violated. To work this problem around we're snapshotting all external state inside `_ActorTaskScheduler.refresh_state(...)` method: - State is snapshotted (refreshed periodically) - This way state is synchronized with the other Operator's state - This makes it impossible for `can_schedule_task` and `select_actors` to get out of sync ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

alexeykudinkin added 4 commits February 4, 2026 18:18

Fixed APMO to follow input handling protocol

1025683

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added tests

f63acd3

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed tests

1b84b64

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

ae12c2c

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin requested a review from a team as a code owner February 5, 2026 02:47

gemini-code-assist bot reviewed Feb 5, 2026

View reviewed changes

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py Outdated Show resolved Hide resolved

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py Show resolved Hide resolved

Fixed condition

d9ddb0b

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Feb 5, 2026

cursor bot reviewed Feb 5, 2026

View reviewed changes

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py Show resolved Hide resolved

Updating py-doc

1613f5d

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

cursor bot reviewed Feb 5, 2026

View reviewed changes

python/ray/data/_internal/execution/operators/actor_pool_map_operator.py Show resolved Hide resolved

raulchen approved these changes Feb 5, 2026

View reviewed changes

alexeykudinkin added 2 commits February 4, 2026 22:07

Fixed run_op_tasks_sync and run_one_op_task utils

bca0be8

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Added more tests

366d482

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin enabled auto-merge (squash) February 5, 2026 06:14

alexeykudinkin merged commit 50c715e into master Feb 5, 2026
7 checks passed

alexeykudinkin deleted the ak/apmo-lvns-fix branch February 5, 2026 07:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fixing `ActorPoolMapOperator` to guarantee dispatch of all given inputs#60763

[Data] Fixing `ActorPoolMapOperator` to guarantee dispatch of all given inputs#60763
alexeykudinkin merged 8 commits intomasterfrom
ak/apmo-lvns-fix

alexeykudinkin commented Feb 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

raulchen Feb 5, 2026

Uh oh!

alexeykudinkin Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexeykudinkin commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

APMO is aligned with task submission protocol

Revisited _ActorTaskSelector to keep it in sync with the Operator

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

raulchen Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexeykudinkin commented Feb 5, 2026 •

edited

Loading