[Data] Info log cluster scale up decisions by bveeramani · Pull Request #60357 · ray-project/ray

bveeramani · 2026-01-21T05:07:59Z

I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion

This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…checks

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

gemini-code-assist · 2026-01-21T05:08:04Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

owenowenisme · 2026-01-21T07:04:01Z

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

            active_bundles, pending_bundles, self._resource_limits
        )

+        if resource_request != active_bundles:


Can we unify the naming? I'm confused with the bundle & request naming

Discussed offline -- will address in follow-up

bveeramani · 2026-01-21T05:08:25Z

python/ray/data/_internal/cluster_autoscaler/__init__.py

-            resource_limits=resource_limits,
            execution_id=execution_id,
+            resource_limits=resource_limits,


This is part of a drive-by change to make resource_limits an optional parameter so there's less boilerplate for tests

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

…utoscaler_v2.py Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

python/ray/data/tests/test_default_cluster_autoscaler_v2.py

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…into log-info-message Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

machichima · 2026-01-21T21:58:22Z

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

+            f"specified threshold of {self._cluster_scaling_up_util_threshold:.0%}: "
+            f"CPU={current_utilization.cpu:.0%}, GPU={current_utilization.gpu:.0%}, "
+            f"object_store_memory={current_utilization.object_store_memory:.0%}. "
+            "Requesting one node of each shape:"


I think the number of nodes is controlled by self._cluster_scaling_up_delta? Should we also pass delta_count = int(math.ceil(self._cluster_scaling_up_delta)) into this function and print it here?

Good catch. That was an oversight

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

python/ray/data/tests/test_default_cluster_autoscaler_v2.py

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-01-28T23:32:18Z

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py

+            f"specified threshold of {self._cluster_scaling_up_util_threshold:.0%}: "
+            f"CPU={current_utilization.cpu:.0%}, GPU={current_utilization.gpu:.0%}, "
+            f"object_store_memory={current_utilization.object_store_memory:.0%}. "
+            f"Requesting {self._cluster_scaling_up_delta} node(s) of each shape:"


Log message claims wrong number of nodes after capping

Low Severity

The log message states "Requesting {_cluster_scaling_up_delta} node(s) of each shape" but this value reflects the configured delta before resource limit capping. After cap_resource_request_to_limits is applied, the actual number of nodes added per shape may be less than _cluster_scaling_up_delta. The detailed counts (e.g., "[{...}: 1 -> 1]") show the correct values, but the header text creates a confusing mismatch.

> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>

> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: 400Ping <jiekaichang@apache.org>

> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Adel Nour <ans9868@nyu.edu>

> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>

bveeramani and others added 7 commits January 20, 2026 16:39

Initial commit

7ab22ad

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Initial commit

aaaaea4

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Merge branch 'master' into reorganize-constructor

1d690f7

Merge branch 'reorganize-constructor' into stop-throttle-utilization-…

1129595

…checks

Address review comments

80ad3d0

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Initial commit

2d02e43

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Refactor

f9e691f

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani requested a review from a team as a code owner January 21, 2026 05:08

Base automatically changed from stop-throttle-utilization-checks to reorganize-constructor January 21, 2026 05:22

owenowenisme reviewed Jan 21, 2026

View reviewed changes

ray-gardener bot added core Issues that should be addressed in Ray Core data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jan 21, 2026

Base automatically changed from reorganize-constructor to master January 21, 2026 07:28

Merge branch 'master' into log-info-message

4797767

bveeramani commented Jan 21, 2026

View reviewed changes

Update python/ray/data/_internal/cluster_autoscaler/default_cluster_a…

0ad88a6

…utoscaler_v2.py Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

cursor bot reviewed Jan 21, 2026

View reviewed changes

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py Show resolved Hide resolved

marwan116 reviewed Jan 21, 2026

View reviewed changes

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py Outdated Show resolved Hide resolved

marwan116 reviewed Jan 21, 2026

View reviewed changes

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py Outdated Show resolved Hide resolved

marwan116 reviewed Jan 21, 2026

View reviewed changes

python/ray/data/tests/test_default_cluster_autoscaler_v2.py Outdated Show resolved Hide resolved

bveeramani added 2 commits January 21, 2026 10:22

Address review comments

1c2b974

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

Merge branch 'log-info-message' of https://github.com/ray-project/ray …

9a24fcf

…into log-info-message Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani enabled auto-merge (squash) January 21, 2026 19:07

github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 21, 2026

machichima reviewed Jan 21, 2026

View reviewed changes

bveeramani disabled auto-merge January 21, 2026 23:10

owenowenisme approved these changes Jan 22, 2026

View reviewed changes

Address review comments

fcf97e1

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

bveeramani enabled auto-merge (squash) January 28, 2026 23:03

cursor bot reviewed Jan 28, 2026

View reviewed changes

python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py Outdated Show resolved Hide resolved

Fix bug

4b35f73

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

github-actions bot disabled auto-merge January 28, 2026 23:11

cursor bot reviewed Jan 28, 2026

View reviewed changes

python/ray/data/tests/test_default_cluster_autoscaler_v2.py Outdated Show resolved Hide resolved

Fix test failure

d656775

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

cursor bot reviewed Jan 28, 2026

View reviewed changes

bveeramani merged commit cb31660 into master Jan 29, 2026
5 of 6 checks passed

bveeramani deleted the log-info-message branch January 29, 2026 01:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Info log cluster scale up decisions#60357

[Data] Info log cluster scale up decisions#60357
bveeramani merged 14 commits intomasterfrom
log-info-message

bveeramani commented Jan 21, 2026

Uh oh!

gemini-code-assist bot commented Jan 21, 2026

Uh oh!

Uh oh!

owenowenisme Jan 21, 2026

Uh oh!

bveeramani Jan 21, 2026

Uh oh!

bveeramani Jan 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

machichima Jan 21, 2026

Uh oh!

bveeramani Jan 21, 2026

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bveeramani commented Jan 21, 2026

Uh oh!

gemini-code-assist bot commented Jan 21, 2026

Uh oh!

Uh oh!

owenowenisme Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

bveeramani Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

bveeramani Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

machichima Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

bveeramani Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 28, 2026

Choose a reason for hiding this comment

Log message claims wrong number of nodes after capping

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants