Skip to content

[core][observability] Report idle node information in status and dashboard#39638

Merged
rickyyx merged 13 commits intoray-project:masterfrom
vitsai:idle-state
Sep 25, 2023
Merged

[core][observability] Report idle node information in status and dashboard#39638
rickyyx merged 13 commits intoray-project:masterfrom
vitsai:idle-state

Conversation

@vitsai
Copy link
Contributor

@vitsai vitsai commented Sep 13, 2023

Plumbs through idle node information to be reflected in ray status both in the CLI and on the dashboard. Does not include additional changes in the cluster tab of the dashboard UI, but does plumb the status field through to the datasource for dashboard consumption.

  • List of idle nodes
  • In verbose mode, will print node activity (reasons node is not idle) for each node
======== Autoscaler status: 2023-09-22 23:08:42.399287 ========
GCS request time: 0.000781s

Node status
---------------------------------------------------------------
Active:
 1 node_328da3b1e9273cf946f6ac3dfee9404dacf429566cd0809a7f04f01c
Idle:
 (no idle nodes)
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 1.0/36.0 CPU
 0B/36.04GiB memory
 0B/18.02GiB object_store_memory

Total Demands:
 (no resource demands)

Node: 328da3b1e9273cf946f6ac3dfee9404dacf429566cd0809a7f04f01c
 Usage:
  1.0/36.0 CPU
  0B/36.04GiB memory
  0B/18.02GiB object_store_memory
 Activity:
  Resource: CPU currently in use.
  Busy workers on node.

Why are these changes needed?

Related issue number

#35411

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Member

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good to me! some nits and comments.

So i guess the actual printing or showing on the dashboard part will be in another PR?

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Copy link
Member

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good to me!

The only thing I wanna double check is just the Snapshot status setting logic (if the idle one will get overridden), and if we could have a test that tests the new node activity thing e2e? I think so far it's either being mocked. If e2e is non trivial, then some manual tests with ray status -v is also fine, and some unit testing.

And a couple of nits, including

  • Write in the PR description for what's being changed here in terms of output?

@rickyyx
Copy link
Member

rickyyx commented Sep 22, 2023

Also - i guess the PR doesn't update the dashboard view yet? Or it's automatically handled with changes in the PR?

i.e. the active status:

image

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the PR description?

Ideally also show some example outputs of status and dashboard after your change.

@vitsai
Copy link
Contributor Author

vitsai commented Sep 22, 2023

Screenshot 2023-09-22 at 10 19 52 AM

Right now, the idle state is reflected in dashboard here. I wasn't sure about adding it to that part of the cluster tab because it displays information at a per-worker granularity, whereas we have idle information at a per-node granularity.

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Copy link
Member

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idle time thing being done outside of this PR? If so, let's document it in a TODO.

switch (std::get<WorkFootprint>(iter.first)) {
case WorkFootprint::NODE_WORKERS:
node_activity << " Node currently has leased workers." << std::endl;
resources_data.add_node_activity("Busy workers on node.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resources_data.add_node_activity("Busy workers on node.");
resources_data.add_node_activity("Active workers.");

?

@rickyyx
Copy link
Member

rickyyx commented Sep 23, 2023

Copy link
Member

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


via GIPHY

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
@rickyyx rickyyx merged commit a2dedf1 into ray-project:master Sep 25, 2023
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Sep 26, 2023
…board (ray-project#39638)


---------

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
rkooo567 pushed a commit to rkooo567/ray that referenced this pull request Sep 28, 2023
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…board (ray-project#39638)

---------

Signed-off-by: vitsai <vitsai@cs.stanford.edu>
Signed-off-by: Victor <vctr.y.m@example.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants