[Data] Info log cluster scale up decisions#60357
Conversation
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
| active_bundles, pending_bundles, self._resource_limits | ||
| ) | ||
|
|
||
| if resource_request != active_bundles: |
There was a problem hiding this comment.
Can we unify the naming? I'm confused with the bundle & request naming
There was a problem hiding this comment.
Discussed offline -- will address in follow-up
| resource_limits=resource_limits, | ||
| execution_id=execution_id, | ||
| resource_limits=resource_limits, |
There was a problem hiding this comment.
This is part of a drive-by change to make resource_limits an optional parameter so there's less boilerplate for tests
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
…utoscaler_v2.py Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Show resolved
Hide resolved
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…into log-info-message Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
| f"specified threshold of {self._cluster_scaling_up_util_threshold:.0%}: " | ||
| f"CPU={current_utilization.cpu:.0%}, GPU={current_utilization.gpu:.0%}, " | ||
| f"object_store_memory={current_utilization.object_store_memory:.0%}. " | ||
| "Requesting one node of each shape:" |
There was a problem hiding this comment.
I think the number of nodes is controlled by self._cluster_scaling_up_delta? Should we also pass delta_count = int(math.ceil(self._cluster_scaling_up_delta)) into this function and print it here?
There was a problem hiding this comment.
Good catch. That was an oversight
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
python/ray/data/_internal/cluster_autoscaler/default_cluster_autoscaler_v2.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
| f"specified threshold of {self._cluster_scaling_up_util_threshold:.0%}: " | ||
| f"CPU={current_utilization.cpu:.0%}, GPU={current_utilization.gpu:.0%}, " | ||
| f"object_store_memory={current_utilization.object_store_memory:.0%}. " | ||
| f"Requesting {self._cluster_scaling_up_delta} node(s) of each shape:" |
There was a problem hiding this comment.
Log message claims wrong number of nodes after capping
Low Severity
The log message states "Requesting {_cluster_scaling_up_delta} node(s) of each shape" but this value reflects the configured delta before resource limit capping. After cap_resource_request_to_limits is applied, the actual number of nodes added per shape may be less than _cluster_scaling_up_delta. The detailed counts (e.g., "[{...}: 1 -> 1]") show the correct values, but the header text creates a confusing mismatch.
> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: 400Ping <jiekaichang@apache.org>
> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Adel Nour <ans9868@nyu.edu>
> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
> I am thinking there is very limited visibility into the autoscaling decisions, currently have to look through DEBUG logs FWICT. Adding some visibility in terms of metrics and events would be nice, and promoting key action logs to INFO would be my high-level suggestion This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
This PR logs a message to STDOUT whenever the autoscaler decides to scale up the cluster.