[Serve] Fix replicas hanging forever when requests are stuck draining in direct ingress mode#60754
[Serve] Fix replicas hanging forever when requests are stuck draining in direct ingress mode#60754abrarsheikh merged 7 commits intomasterfrom
Conversation
Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where replicas in direct ingress mode could hang indefinitely if requests were stuck during the draining process. The fix ensures that replicas are now forcefully killed after a timeout, which is calculated as the maximum of the deployment's graceful_shutdown_timeout_s and a new minimum draining period constant for direct ingress. This change moves the timeout enforcement logic to the controller, simplifying the replica's shutdown process. Additionally, the RAY_SERVE_DISABLE_SHUTTING_DOWN_INGRESS_REPLICAS_FORCEFULLY feature flag has been removed, which further simplifies the code and makes the force-kill behavior more robust. The changes are well-structured and improve the overall code quality.
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
good point, for now I am going to add a comment in code since DI is not public |
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
eicherseiji
left a comment
There was a problem hiding this comment.
My understanding of this PR: Makes force_stop unconditional and updates _shutdown_deadline to enforce RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S within the check_and_update_replicas's check_stopped function to guarantee replicas are stopped, even while draining.
… in direct ingress mode (#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com>
… in direct ingress mode (ray-project#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
… in direct ingress mode (ray-project#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: tiennguyentony <46289799+tiennguyentony@users.noreply.github.com>
… in direct ingress mode (ray-project#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com>
… in direct ingress mode (#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
… in direct ingress mode (#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com>
… in direct ingress mode (ray-project#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: Muhammad Saif <2024BBIT200@student.Uet.edu.pk>
… in direct ingress mode (ray-project#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com>
… in direct ingress mode (ray-project#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
… in direct ingress mode (ray-project#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com>
… in direct ingress mode (ray-project#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
… in direct ingress mode (ray-project#60754) - In direct ingress mode, replicas waiting for requests to drain were never force-killed, causing them to hang indefinitely if requests got stuck - Replicas are now force-killed after `max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)` ## Before | Scenario | Behavior | |----------|----------| | Requests drain normally | Wait forever for 30s min drain period | | Requests stuck | Hang forever | ## After | Scenario | Behavior | |----------|----------| | Requests drain normally | Force-kill after `max(timeout, 30s)` | | Requests stuck | Force-kill after `max(timeout, 30s)` | ## Test Plan - [x] `python/ray/serve/tests/unit/test_deployment_state.py` - [x] `python/ray/serve/tests/test_direct_ingress.py` <!-- BUGBOT_STATUS --><sup><a href="https://cursor.com/dashboard?tab=bugbot">Cursor Bugbot</a> reviewed your changes and found no issues for commit <u>6460607</u></sup><!-- /BUGBOT_STATUS --> --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
max(graceful_shutdown_timeout_s, RAY_SERVE_DIRECT_INGRESS_MIN_DRAINING_PERIOD_S)Before
After
max(timeout, 30s)max(timeout, 30s)Test Plan
-
-
Cursor Bugbot reviewed your changes and found no issues for commit 6460607python/ray/serve/tests/unit/test_deployment_state.pypython/ray/serve/tests/test_direct_ingress.py