[Serve][2/N] Add deployment-level autoscaling snapshot and event summarizer by nadongjun · Pull Request #56225 · ray-project/ray

nadongjun · 2025-09-04T06:49:33Z

Why are these changes needed?

This PR introduces deployment-level autoscaling observability in Serve. The controller now emits a single, structured JSON log line (serve_autoscaling_snapshot) per autoscaling-enabled deployment each control-loop tick.

This avoids recomputation in the controller call sites and provides a stable, machine-parsable surface for tooling and debugging.

Changed

Add get_observability_snapshot in AutoscalingState and manager wrapper to generate compact snapshots (replica counts, queued/total requests, metric freshness).
Add ServeEventSummarizer to build payloads, reduce duplicate logs, and summarize recent scaling decisions.

Example log (single line):

Logs can be found in controller log files, e.g. /tmp/ray/session_2025-09-03_21-12-01_095657_13385/logs/serve/controller_13474.log.

serve_autoscaling_snapshot {"ts":"2025-09-04T06:12:11Z","app":"default","deployment":"worker","current_replicas":2,"target_replicas":2,"replicas_allowed":{"min":1,"max":8},"scaling_status":"stable","policy":"default","metrics":{"look_back_period_s":10.0,"queued_requests":0.0,"total_requests":0.0},"metrics_health":"ok","errors":[],"decisions":[{"ts":"2025-09-04T06:12:11Z","from":0,"to":2,"reason":"current=0, proposed=2"},{"ts":"2025-09-04T06:12:11Z","from":2,"to":2,"reason":"current=2, proposed=2"}]}

Follow-ups

Expose the same snapshot data via serve status -v and CLI/SDK surfaces.
Aggregate per-app snapshots and external scaler history.

Related issue number

#55834

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

gemini-code-assist

Code Review

This pull request introduces valuable observability features for autoscaling in Ray Serve by adding structured JSON logs for autoscaling snapshots. The implementation is solid, with a new ServeEventSummarizer to handle log formatting and throttling, and new methods in AutoscalingState to provide the necessary data.

My review includes a few suggestions for improvement:

A high-severity issue where a hardcoded policy name is used in ScalingDecision objects, which should be corrected to use the dynamically determined policy name.
A medium-severity issue in the logging utility where missing timestamps are replaced with the current time, which could be misleading.
A medium-severity suggestion to refactor duplicated logic for accessing configuration values to improve code maintainability.

python/ray/serve/_private/controller.py

python/ray/serve/_private/logging_utils.py

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

abrarsheikh

my main feedback about this PR is that we are creating many intermediate free form dictionaries, and it is not clear to me why we need them all but importantly they create ambiguity in future about what each dictionary is supposed to contain making maintaining code harder. The code can be reorganized better, used typed objects to function that need to return large dictionaries.

python/ray/serve/_private/controller.py

python/ray/serve/_private/logging_utils.py

python/ray/serve/_private/controller.py

…except and unused func(note_once_per_interval) Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

…er, and add constant Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

…remove unnecessary getattr Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

akyang-anyscale

Thanks for the contribution @nadongjun! Have you thought about how this would change/work with application-level autoscaling which is in flight: #56149? When application-level autoscaling is enabled, deployment does not autoscale by-itself, so that may change how user should interpret the logs.

As a feedback for the PR, I would recommend packaging the various autoscaling relevant values into objects, and pass that object around. It's somewhat difficult to track all the different variables and where they come from, and makes the code a bit harder to parse.

python/ray/serve/_private/autoscaling_state.py

python/ray/serve/_private/controller.py

python/ray/serve/_private/autoscaling_state.py

python/ray/serve/_private/controller.py

- Rename get_observability_snapshot → get_snapshot for clarity - Rename proposed_replicas → target_replicas across snapshot flow - Return last_metrics_age_s=None when no metrics; map to "unknown" in summarizer - Flatten replicas_allowed{min,max} into top-level min, max in snapshot payload - Move look_back_period_s to top-level for consistency - Rename DecisionSummary → AutoscalingDecisionSummary for clarity - Replace tuple-based SnapshotSignature with typed dataclass - Use DeploymentID directly as dedupe key instead of (app_name, dep_name) - Inline snapshot computation in controller; remove _compute_snapshot_inputs - Push scaling_status formatting into log_deployment_snapshot - Update tests to validate new payload shape (min/max, no replicas_allowed) Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

- Standardize payload to return 'timestamp_s' for snapshots. - Return metrics health as last_metrics_age_s Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

nadongjun · 2025-09-05T23:46:50Z

@abrarsheikh @akyang-anyscale Thanks for the detailed review!

@akyang-anyscale That’s a fair point. serve_autoscaling_snapshot log format currently only covers deployment-level autoscaling. Once application-level autoscaling is added, we’ll log deployment and application-level snapshots separately.

I’ve already switched to typed dataclasses (e.g., DeploymentSnapshot, AutoscalingDecisionSummary) so the controller passes structured objects instead of dicts. I’ll do the same for application-level autoscaling to keep things consistent.

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

python/ray/serve/_private/autoscaling_state.py

abrarsheikh · 2025-09-08T21:56:37Z

python/ray/serve/_private/autoscaling_state.py


        return total_requests

+    def get_deployment_snapshot(self, curr_target_num_replicas: int) -> Dict[str, Any]:


get_deployment_snapshot is a expensive operation to be performed on every control loop iteration, reason because that it calls get_total_num_requests, loops over replicas and handle. These are expensive operations for a large cluster. Second, it calls self.get_decision_num_replicas which internally executed autoscaling policy which was be expensive.

I suggest instead constructing the DeploymentAutoscalingSnapshot object every time get_decision_num_replicas run and storing that on the class object. Then get_deployment_snapshot simply return the cached DeploymentAutoscalingSnapshot object.

Good call, I’ve applied this. Now the snapshot is constructed once during get_decision_num_replicas() and cached, and get_deployment_snapshot() just returns the cached object.

python/ray/serve/_private/event_summarizer.py

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

akyang-anyscale · 2025-12-02T08:06:59Z

cc @abrarsheikh

python/ray/serve/_private/application_state.py

abrarsheikh · 2025-12-02T19:15:12Z

python/ray/serve/_private/autoscaling_state.py

+            ongoing_requests=float(ctx.total_num_requests),
+            metrics_health=metrics_health,
+            errors=errors,
+            decisions=decisions_summary,


why do we need decisions inside DeploymentSnapshot?

abrarsheikh · 2025-12-02T19:17:47Z

python/ray/serve/_private/controller.py

+            self._autoscaling_logger.info(
+                "", extra={"type": "deployment", "snapshot": payload}
+            )


payload is already json because of model_dump. And type should be part of deployment_snapshot object in my opinion.

the extra argument to logger.info is used in a non traditional way here IMO

python/ray/serve/_private/controller.py

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor

Bug: App-level policies bypass snapshot creation entirely

When applications use app-level autoscaling policies (has_policy() returns True), the ApplicationAutoscalingState.get_decision_num_replicas method calls apply_bounds() directly and returns without ever invoking DeploymentAutoscalingState.get_decision_num_replicas(). The new snapshot creation logic (recording to _decision_history and populating _cached_deployment_snapshot) exists only in the deployment-level method. As a result, deployments under app-level policies will always have _cached_deployment_snapshot remain None, and get_deployment_snapshot() will return None. The controller's _emit_deployment_autoscaling_snapshots silently skips these deployments, making the new observability feature completely non-functional for app-level policy configurations.

python/ray/serve/_private/autoscaling_state.py#L877-L887

ray/python/ray/serve/_private/autoscaling_state.py

Lines 877 to 887 in 8835412

    
           return { 
        
               deployment_id: ( 
        
                   self._deployment_autoscaling_states[deployment_id].apply_bounds( 
        
                       num_replicas 
        
                   ) 
        
                   if not _skip_bound_check 
        
                   else num_replicas 
        
               ) 
        
               for deployment_id, num_replicas in decisions.items() 
        
           }

python/ray/serve/_private/autoscaling_state.py

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

python/ray/serve/_private/controller.py

python/ray/serve/_private/autoscaling_state.py

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

python/ray/serve/_private/autoscaling_state.py

python/ray/serve/_private/controller.py

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

python/ray/serve/_private/autoscaling_state.py

python/ray/serve/_private/common.py

abrarsheikh · 2025-12-12T20:17:31Z

python/ray/serve/_private/controller.py

+        for (
+            app_name,
+            dep_name,
+            details,
+            autoscaling_config,
+        ) in self._autoscaling_enabled_deployments_cache:


We should batch write all deployments at once. this can be slow for application with 1000s of deployments.

I updated the controller to batch autoscaling snapshot logs into a single write per loop, instead of writing once per deployment.

However, in extreme cases where an application has thousands of deployments, writing one huge payload at once could be slow. Should we add a CHUNK_SIZE to emit snapshots in chunks of N to handle this case?

python/ray/serve/_private/controller.py

python/ray/serve/_private/logging_utils.py

…init Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

python/ray/serve/_private/autoscaling_state.py

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor

Bug: App-level policies skip deployment snapshot creation

When using an app-level autoscaling policy (has_policy() returns True), the code path in ApplicationAutoscalingState.get_decision_num_replicas (lines 842-876) directly calls the app-level policy and returns decisions without calling DeploymentAutoscalingState.get_decision_num_replicas(). The _cached_deployment_snapshot is only populated inside DeploymentAutoscalingState.get_decision_num_replicas() (lines 265-268), which is only called when using deployment-level policies (line 880). As a result, get_deployment_snapshot() returns None for deployments using app-level policies, causing _emit_deployment_autoscaling_snapshots to silently skip these deployments without logging any snapshot data.

python/ray/serve/_private/autoscaling_state.py#L841-L876

ray/python/ray/serve/_private/autoscaling_state.py

Lines 841 to 876 in 6e1105a

    
                   """ 
        
                   if self.has_policy(): 
        
                       # Using app-level policy 
        
                       autoscaling_contexts = { 
        
                           deployment_id: state.get_autoscaling_context( 
        
                               deployment_to_target_num_replicas[deployment_id] 
        
                           ) 
        
                           for deployment_id, state in self._deployment_autoscaling_states.items() 
        
                       } 
        
                       # Policy returns {deployment_name -> decision} 
        
                       decisions, self._policy_state = self._policy(autoscaling_contexts) 
        
                       assert ( 
        
                           type(decisions) is dict 
        
                       ), "Autoscaling policy must return a dictionary of deployment_name -> decision_num_replicas" 
        
                       # assert that deployment_id is in decisions is valid 
        
                       for deployment_id in decisions.keys(): 
        
                           assert ( 
        
                               deployment_id in self._deployment_autoscaling_states 
        
                           ), f"Deployment {deployment_id} is not registered" 
        
                           assert ( 
        
                               deployment_id in deployment_to_target_num_replicas 
        
                           ), f"Deployment {deployment_id} is invalid" 
        
                       return { 
        
                           deployment_id: ( 
        
                               self._deployment_autoscaling_states[deployment_id].apply_bounds( 
        
                                   num_replicas 
        
                               ) 
        
                               if not _skip_bound_check 
        
                               else num_replicas 
        
                           ) 
        
                           for deployment_id, num_replicas in decisions.items() 
        
                       }

python/ray/serve/_private/autoscaling_state.py#L262-L268

ray/python/ray/serve/_private/autoscaling_state.py

Lines 262 to 268 in 6e1105a

    
           decision_num_replicas = self.apply_bounds(decision_num_replicas) 
        
           self._cached_deployment_snapshot = self._create_deployment_snapshot( 
        
               ctx=autoscaling_context, 
        
               target_replicas=decision_num_replicas, 
        
           )

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

abrarsheikh · 2025-12-15T17:16:36Z

tests are failing

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

nadongjun · 2025-12-16T05:48:35Z

tests are failing

Fixed the failing tests!

…arizer (#56225)   ## Why are these changes needed? This PR introduces deployment-level autoscaling observability in Serve. The controller now emits a single, structured JSON log line (serve_autoscaling_snapshot) per autoscaling-enabled deployment each control-loop tick. This avoids recomputation in the controller call sites and provides a stable, machine-parsable surface for tooling and debugging. #### Changed - Add get_observability_snapshot in AutoscalingState and manager wrapper to generate compact snapshots (replica counts, queued/total requests, metric freshness). - Add ServeEventSummarizer to build payloads, reduce duplicate logs, and summarize recent scaling decisions. #### Example log (single line): Logs can be found in controller log files, `e.g. /tmp/ray/session_2025-09-03_21-12-01_095657_13385/logs/serve/controller_13474.log`. ``` serve_autoscaling_snapshot {"ts":"2025-09-04T06:12:11Z","app":"default","deployment":"worker","current_replicas":2,"target_replicas":2,"replicas_allowed":{"min":1,"max":8},"scaling_status":"stable","policy":"default","metrics":{"look_back_period_s":10.0,"queued_requests":0.0,"total_requests":0.0},"metrics_health":"ok","errors":[],"decisions":[{"ts":"2025-09-04T06:12:11Z","from":0,"to":2,"reason":"current=0, proposed=2"},{"ts":"2025-09-04T06:12:11Z","from":2,"to":2,"reason":"current=2, proposed=2"}]} ``` #### Follow-ups - Expose the same snapshot data via `serve status -v` and CLI/SDK surfaces. - Aggregate per-app snapshots and external scaler history. ## Related issue number #55834 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Dongjun Na <kmu5544616@gmail.com> Co-authored-by: akyang-anyscale <alexyang@anyscale.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>

…arizer (ray-project#56225)   ## Why are these changes needed? This PR introduces deployment-level autoscaling observability in Serve. The controller now emits a single, structured JSON log line (serve_autoscaling_snapshot) per autoscaling-enabled deployment each control-loop tick. This avoids recomputation in the controller call sites and provides a stable, machine-parsable surface for tooling and debugging. #### Changed - Add get_observability_snapshot in AutoscalingState and manager wrapper to generate compact snapshots (replica counts, queued/total requests, metric freshness). - Add ServeEventSummarizer to build payloads, reduce duplicate logs, and summarize recent scaling decisions. #### Example log (single line): Logs can be found in controller log files, `e.g. /tmp/ray/session_2025-09-03_21-12-01_095657_13385/logs/serve/controller_13474.log`. ``` serve_autoscaling_snapshot {"ts":"2025-09-04T06:12:11Z","app":"default","deployment":"worker","current_replicas":2,"target_replicas":2,"replicas_allowed":{"min":1,"max":8},"scaling_status":"stable","policy":"default","metrics":{"look_back_period_s":10.0,"queued_requests":0.0,"total_requests":0.0},"metrics_health":"ok","errors":[],"decisions":[{"ts":"2025-09-04T06:12:11Z","from":0,"to":2,"reason":"current=0, proposed=2"},{"ts":"2025-09-04T06:12:11Z","from":2,"to":2,"reason":"current=2, proposed=2"}]} ``` #### Follow-ups - Expose the same snapshot data via `serve status -v` and CLI/SDK surfaces. - Aggregate per-app snapshots and external scaler history. ## Related issue number ray-project#55834 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Dongjun Na <kmu5544616@gmail.com> Co-authored-by: akyang-anyscale <alexyang@anyscale.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>

…arizer (ray-project#56225)   ## Why are these changes needed? This PR introduces deployment-level autoscaling observability in Serve. The controller now emits a single, structured JSON log line (serve_autoscaling_snapshot) per autoscaling-enabled deployment each control-loop tick. This avoids recomputation in the controller call sites and provides a stable, machine-parsable surface for tooling and debugging. #### Changed - Add get_observability_snapshot in AutoscalingState and manager wrapper to generate compact snapshots (replica counts, queued/total requests, metric freshness). - Add ServeEventSummarizer to build payloads, reduce duplicate logs, and summarize recent scaling decisions. #### Example log (single line): Logs can be found in controller log files, `e.g. /tmp/ray/session_2025-09-03_21-12-01_095657_13385/logs/serve/controller_13474.log`. ``` serve_autoscaling_snapshot {"ts":"2025-09-04T06:12:11Z","app":"default","deployment":"worker","current_replicas":2,"target_replicas":2,"replicas_allowed":{"min":1,"max":8},"scaling_status":"stable","policy":"default","metrics":{"look_back_period_s":10.0,"queued_requests":0.0,"total_requests":0.0},"metrics_health":"ok","errors":[],"decisions":[{"ts":"2025-09-04T06:12:11Z","from":0,"to":2,"reason":"current=0, proposed=2"},{"ts":"2025-09-04T06:12:11Z","from":2,"to":2,"reason":"current=2, proposed=2"}]} ``` #### Follow-ups - Expose the same snapshot data via `serve status -v` and CLI/SDK surfaces. - Aggregate per-app snapshots and external scaler history. ## Related issue number ray-project#55834 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Dongjun Na <kmu5544616@gmail.com> Co-authored-by: akyang-anyscale <alexyang@anyscale.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

nadongjun added 2 commits September 4, 2025 14:06

[Serve] Add deployment-level autoscaling snapshot and event summarizer

58bb94f

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

Add autoscaling snapshot log test

5a8ba60

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

nadongjun requested a review from a team as a code owner September 4, 2025 06:49

nadongjun changed the title ~~[Serve][1/N] Add deployment-level autoscaling snapshot and event summarizer~~ [Serve][2/N] Add deployment-level autoscaling snapshot and event summarizer Sep 4, 2025

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

python/ray/serve/_private/controller.py Outdated Show resolved Hide resolved

python/ray/serve/_private/controller.py Outdated Show resolved Hide resolved

python/ray/serve/_private/logging_utils.py Outdated Show resolved Hide resolved

ray-gardener bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Sep 4, 2025

rafactor and lint

b5fdc02

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

abrarsheikh requested review from abrarsheikh and akyang-anyscale September 4, 2025 16:27

Merge branch 'master' into serve-obsv-deployment

4786916

abrarsheikh reviewed Sep 5, 2025

View reviewed changes

nadongjun added 8 commits September 5, 2025 11:32

[comment 1, 2, 4] Replace dicts with typed objects, remove redundant …

0f4c16d

…except and unused func(note_once_per_interval) Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

[comment 3, 5, 6] Move/rename autoscaling summarizer, use global logg…

27e2d06

…er, and add constant Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

[comment 8, 9] Add InternalAutoscalingConfig for normalized type and …

bcc5925

…remove unnecessary getattr Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

[comment 10] use to_log_dict() and emit_deployment_snapshot()

d6de660

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

Add constant(AUTOSCALER_SUMMARIZER_DECISION_LIMIT)

0c0064d

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

[comment 7] Refactor _emit_deployment_autoscaling_snapshots

309d474

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

lint

86e25fe

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

lint

7eeeac5

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

akyang-anyscale reviewed Sep 5, 2025

View reviewed changes

nadongjun added 3 commits September 6, 2025 08:06

[review] Snapshot timestamp, metrics age formatting

e04d968

- Standardize payload to return 'timestamp_s' for snapshots. - Return metrics health as last_metrics_age_s Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

[review] Inline build_deployment_snapshot into log_deployment_snapshot

0107c85

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

nadongjun added 3 commits September 8, 2025 08:29

Merge branch 'master' into serve-obsv-deployment

625948b

Refactor improve readability

258165c

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

Refactor improve readability

4cceedd

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

abrarsheikh reviewed Sep 8, 2025

View reviewed changes

Update comments and var name

fd239b7

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

abrarsheikh reviewed Dec 2, 2025

View reviewed changes

nadongjun and others added 2 commits December 4, 2025 10:17

Merge remote-tracking branch 'origin/master' into serve-obsv-deployment

e5f0938

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

Update python/ray/serve/_private/application_state.py

8835412

Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com> Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor bot reviewed Dec 4, 2025

View reviewed changes

python/ray/serve/_private/autoscaling_state.py Outdated Show resolved Hide resolved

Edit autoscaling snapshot logging and add deduplication test

2342fbf

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor bot reviewed Dec 10, 2025

View reviewed changes

python/ray/serve/_private/controller.py Show resolved Hide resolved

python/ray/serve/_private/autoscaling_state.py Outdated Show resolved Hide resolved

python/ray/serve/_private/autoscaling_state.py Outdated Show resolved Hide resolved

Remove decisions history from DeploymentSnapshot

cd27f39

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor bot reviewed Dec 10, 2025

View reviewed changes

python/ray/serve/_private/autoscaling_state.py Show resolved Hide resolved

python/ray/serve/_private/controller.py Outdated Show resolved Hide resolved

nadongjun added 2 commits December 10, 2025 10:12

Fix Pydantic compatibility

c03acbd

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

Fix memory leak

76eab7f

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor bot reviewed Dec 10, 2025

View reviewed changes

python/ray/serve/_private/autoscaling_state.py Show resolved Hide resolved

abrarsheikh reviewed Dec 12, 2025

View reviewed changes

Remove unused DecisionRecord and clarify autoscaling snapshot logger …

55b9c35

…init Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor bot reviewed Dec 15, 2025

View reviewed changes

python/ray/serve/_private/autoscaling_state.py Show resolved Hide resolved

python/ray/serve/_private/autoscaling_state.py Show resolved Hide resolved

nadongjun added 2 commits December 15, 2025 14:39

Batch autoscaling snapshot logs per control loop and update tests

6e13101

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

Fix autoscaling metrics timestamp regression

6e1105a

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor bot reviewed Dec 15, 2025

View reviewed changes

Add comment

32b6d1b

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

nadongjun added 2 commits December 16, 2025 08:42

Update test_autoscaling_snapshot_batched_single_write_per_loop

3702f32

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

Merge branch 'master' into serve-obsv-deployment

d2f4c73

abrarsheikh approved these changes Dec 17, 2025

View reviewed changes

abrarsheikh merged commit f297c98 into ray-project:master Dec 17, 2025
6 checks passed

nadongjun mentioned this pull request Jan 9, 2026

[Serve][3/N] Add application-level autoscaling snapshot #59995

Open


		return total_requests

		def get_deployment_snapshot(self, curr_target_num_replicas: int) -> Dict[str, Any]:


	return {
	deployment_id: (
	self._deployment_autoscaling_states[deployment_id].apply_bounds(
	num_replicas
	)
	if not _skip_bound_check
	else num_replicas
	)
	for deployment_id, num_replicas in decisions.items()
	}

	"""
	if self.has_policy():
	# Using app-level policy
	autoscaling_contexts = {
	deployment_id: state.get_autoscaling_context(
	deployment_to_target_num_replicas[deployment_id]
	)
	for deployment_id, state in self._deployment_autoscaling_states.items()
	}

	# Policy returns {deployment_name -> decision}
	decisions, self._policy_state = self._policy(autoscaling_contexts)

	assert (
	type(decisions) is dict
	), "Autoscaling policy must return a dictionary of deployment_name -> decision_num_replicas"

	# assert that deployment_id is in decisions is valid
	for deployment_id in decisions.keys():
	assert (
	deployment_id in self._deployment_autoscaling_states
	), f"Deployment {deployment_id} is not registered"
	assert (
	deployment_id in deployment_to_target_num_replicas
	), f"Deployment {deployment_id} is invalid"

	return {
	deployment_id: (
	self._deployment_autoscaling_states[deployment_id].apply_bounds(
	num_replicas
	)
	if not _skip_bound_check
	else num_replicas
	)
	for deployment_id, num_replicas in decisions.items()
	}


	decision_num_replicas = self.apply_bounds(decision_num_replicas)

	self._cached_deployment_snapshot = self._create_deployment_snapshot(
	ctx=autoscaling_context,
	target_replicas=decision_num_replicas,
	)

Conversation

nadongjun commented Sep 4, 2025

Why are these changes needed?

Changed

Example log (single line):

Follow-ups

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akyang-anyscale left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nadongjun commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

nadongjun Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

akyang-anyscale commented Dec 2, 2025

Uh oh!

Uh oh!

abrarsheikh Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

abrarsheikh Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: App-level policies bypass snapshot creation entirely

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abrarsheikh Dec 12, 2025

Choose a reason for hiding this comment

akyang-anyscale left a comment •

edited

Loading