Skip to content

[Data] Info log cluster scale up decisions#60357

Merged
bveeramani merged 14 commits intomasterfrom
log-info-message
Jan 29, 2026
Merged

[Data] Info log cluster scale up decisions#60357
bveeramani merged 14 commits intomasterfrom
log-info-message

Conversation

@bveeramani
Copy link
Member

I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion

This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster.

bveeramani and others added 7 commits January 20, 2026 16:39
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner January 21, 2026 05:08
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Base automatically changed from stop-throttle-utilization-checks to reorganize-constructor January 21, 2026 05:22
active_bundles, pending_bundles, self._resource_limits
)

if resource_request != active_bundles:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we unify the naming? I'm confused with the bundle & request naming

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline -- will address in follow-up

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jan 21, 2026
Base automatically changed from reorganize-constructor to master January 21, 2026 07:28
Comment on lines -43 to +44
resource_limits=resource_limits,
execution_id=execution_id,
resource_limits=resource_limits,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of a drive-by change to make resource_limits an optional parameter so there's less boilerplate for tests

…utoscaler_v2.py

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…into log-info-message

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) January 21, 2026 19:07
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 21, 2026
f"specified threshold of {self._cluster_scaling_up_util_threshold:.0%}: "
f"CPU={current_utilization.cpu:.0%}, GPU={current_utilization.gpu:.0%}, "
f"object_store_memory={current_utilization.object_store_memory:.0%}. "
"Requesting one node of each shape:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the number of nodes is controlled by self._cluster_scaling_up_delta? Should we also pass delta_count = int(math.ceil(self._cluster_scaling_up_delta)) into this function and print it here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That was an oversight

@bveeramani bveeramani disabled auto-merge January 21, 2026 23:10
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) January 28, 2026 23:03
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@github-actions github-actions bot disabled auto-merge January 28, 2026 23:11
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

f"specified threshold of {self._cluster_scaling_up_util_threshold:.0%}: "
f"CPU={current_utilization.cpu:.0%}, GPU={current_utilization.gpu:.0%}, "
f"object_store_memory={current_utilization.object_store_memory:.0%}. "
f"Requesting {self._cluster_scaling_up_delta} node(s) of each shape:"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log message claims wrong number of nodes after capping

Low Severity

The log message states "Requesting {_cluster_scaling_up_delta} node(s) of each shape" but this value reflects the configured delta before resource limit capping. After cap_resource_request_to_limits is applied, the actual number of nodes added per shape may be less than _cluster_scaling_up_delta. The detailed counts (e.g., "[{...}: 1 -> 1]") show the correct values, but the header text creates a confusing mismatch.

Fix in Cursor Fix in Web

@bveeramani bveeramani merged commit cb31660 into master Jan 29, 2026
5 of 6 checks passed
@bveeramani bveeramani deleted the log-info-message branch January 29, 2026 01:39
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
> I am thinking there is very limited visibility into the autoscaling
decisions, currently have to look through DEBUG logs FWICT. Adding some
visibility in terms of metrics and events would be nice, and promoting
key action logs to INFO would be my high-level suggestion

This PR logs a message to STDOUT whenever the autoscaler decides to
scale up the cluster.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jan 29, 2026
> I am thinking there is very limited visibility into the autoscaling
decisions, currently have to look through DEBUG logs FWICT. Adding some
visibility in terms of metrics and events would be nice, and promoting
key action logs to INFO would be my high-level suggestion

This PR logs a message to STDOUT whenever the autoscaler decides to
scale up the cluster.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
> I am thinking there is very limited visibility into the autoscaling
decisions, currently have to look through DEBUG logs FWICT. Adding some
visibility in terms of metrics and events would be nice, and promoting
key action logs to INFO would be my high-level suggestion

This PR logs a message to STDOUT whenever the autoscaler decides to
scale up the cluster.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: 400Ping <jiekaichang@apache.org>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
> I am thinking there is very limited visibility into the autoscaling
decisions, currently have to look through DEBUG logs FWICT. Adding some
visibility in terms of metrics and events would be nice, and promoting
key action logs to INFO would be my high-level suggestion

This PR logs a message to STDOUT whenever the autoscaler decides to
scale up the cluster.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
> I am thinking there is very limited visibility into the autoscaling
decisions, currently have to look through DEBUG logs FWICT. Adding some
visibility in terms of metrics and events would be nice, and promoting
key action logs to INFO would be my high-level suggestion

This PR logs a message to STDOUT whenever the autoscaler decides to
scale up the cluster.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
> I am thinking there is very limited visibility into the autoscaling
decisions, currently have to look through DEBUG logs FWICT. Adding some
visibility in terms of metrics and events would be nice, and promoting
key action logs to INFO would be my high-level suggestion

This PR logs a message to STDOUT whenever the autoscaler decides to
scale up the cluster.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core data Ray Data-related issues go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants