Skip to content

koordlet: define GPU metric struct#343

Merged
koordinator-bot[bot] merged 1 commit intokoordinator-sh:mainfrom
jasonliu747:gpu-metrics
Jul 8, 2022
Merged

koordlet: define GPU metric struct#343
koordinator-bot[bot] merged 1 commit intokoordinator-sh:mainfrom
jasonliu747:gpu-metrics

Conversation

@jasonliu747
Copy link
Member

@jasonliu747 jasonliu747 commented Jul 5, 2022

Signed-off-by: Jason Liu jasonliu747@gmail.com

Ⅰ. Describe what this PR does

Add several necessary GPU metrics.

Ⅱ. Does this pull request fix one issue?

Prerequisite for #323

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

Here are two sample outputs of nvidia-smi

$nvidia-smi
Tue Jul  5 15:58:02 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:4F:00.0 Off |                    0 |
| N/A   38C    P0    67W / 300W |  32504MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:50:00.0 Off |                    0 |
| N/A   39C    P0    68W / 300W |  32502MiB / 32510MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    160679      C   ...r@#2/workspace/python_bin    32453MiB |
|    1   N/A  N/A    160680      C   ...r@#2/workspace/python_bin    32457MiB |
+-----------------------------------------------------------------------------+
$nvidia-smi pmon -c 1
# gpu        pid  type    sm   mem   enc   dec   command
# Idx          #   C/G     %     %     %     %   name
    0     160679     C    90    11     -     -   python_bin
    1     160680     C    93     1     -     -   python_bin

Noted:
NodeResourceMetric.GPUs.SMUtil is equal to Volatile GPU-Util in the output of nvidia-smi
PodResourceMetric.GPUs.SMUtil is equal to sm(%) in the output of nvidia-smi pmon

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@koordinator-bot koordinator-bot bot requested review from saintube and stormgbs July 5, 2022 07:53
@jasonliu747
Copy link
Member Author

/cc @zwzhang0107

@koordinator-bot koordinator-bot bot requested a review from zwzhang0107 July 5, 2022 07:53
@codecov
Copy link

codecov bot commented Jul 5, 2022

Codecov Report

Merging #343 (8a72667) into main (54ed9a5) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #343   +/-   ##
=======================================
  Coverage   64.85%   64.85%           
=======================================
  Files         116      116           
  Lines       11451    11451           
=======================================
  Hits         7426     7426           
  Misses       3440     3440           
  Partials      585      585           
Flag Coverage Δ
unittests 64.85% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54ed9a5...8a72667. Read the comment docs.

@jasonliu747 jasonliu747 force-pushed the gpu-metrics branch 2 times, most recently from c3c9448 to 8fbb7e1 Compare July 6, 2022 09:26
@LambdaHJ
Copy link
Contributor

LambdaHJ commented Jul 7, 2022

/lgtm

@koordinator-bot
Copy link

@LambdaHJ: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@LambdaHJ LambdaHJ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@koordinator-bot
Copy link

@LambdaHJ: changing LGTM is restricted to collaborators

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@honpey
Copy link
Contributor

honpey commented Jul 8, 2022

/lgtm

Signed-off-by: Jason Liu <jasonliu747@gmail.com>
@eahydra
Copy link
Member

eahydra commented Jul 8, 2022

/lgtm
/approve

Copy link
Member

@eahydra eahydra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hormes
Copy link
Member

hormes commented Jul 8, 2022

/approve

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eahydra, hormes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@koordinator-bot koordinator-bot bot merged commit 171ad3e into koordinator-sh:main Jul 8, 2022
@jasonliu747 jasonliu747 deleted the gpu-metrics branch July 8, 2022 14:38
@jasonliu747 jasonliu747 added this to the v0.6 milestone Jul 15, 2022
@jasonliu747 jasonliu747 modified the milestones: v0.6, v0.7 Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants