USHIFT-5339: Test AI Model Serving with NVIDIA GPU #4659

pmtk · 2025-03-11T11:28:38Z

No description provided.

openshift-ci-robot · 2025-03-11T11:28:42Z

@pmtk: This pull request references USHIFT-5339 which is a valid jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

pmtk · 2025-03-12T08:13:55Z

/retest

pmtk · 2025-03-12T12:20:09Z

/retest

ggiguash · 2025-03-12T14:51:19Z

scripts/ci-ai-model-serving/2-reboot.sh

+
+set -xeuo pipefail
+
+sudo reboot now


I'm not sure reboot accepts arguments. I think just sudo reboot would be enough.

ggiguash · 2025-03-12T14:54:15Z

scripts/ci-ai-model-serving/setup/02-nvidia-driver.sh

+    --enable=codeready-builder-for-rhel-9-x86_64-rpms
+
+sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
+sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo


Suggested change

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -m)/cuda-rhel9.repo

ggiguash · 2025-03-12T15:42:54Z

scripts/ci-ai-model-serving/tests/04-verify-device-plugin.sh

+capacity=$(oc get node -o json | jq -r '.items[0].status.capacity')
+gpus=$(echo "${capacity}" | jq -r '."nvidia.com/gpu"')
+
+if [[ "${gpus}" != "1" ]]; then


Should we check that it's not empty? I mean there may be more than one GPUs on the node.

ggiguash · 2025-03-12T15:47:12Z

scripts/ci-ai-model-serving/tests/05-test-cuda-vector-add.sh

+
+# Get the logs
+pod=$(oc get pods -n cuda-test --selector=batch.kubernetes.io/job-name=test-cuda-vector-add --output=jsonpath='{.items[*].metadata.name}')
+logs=$(oc logs -n cuda-test "${pod}")


I'm a bit wary of storing entire log outputs in a variable. Should we use a tmp file instead?

It's only like 6 lines, but sure we can do it

[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done

ggiguash · 2025-03-12T15:47:40Z

scripts/ci-ai-model-serving/tests/05-test-cuda-vector-add.sh

+oc delete job -n cuda-test test-cuda-vector-add
+oc delete ns cuda-test
+
+if ! echo "${logs}" | grep -q PASSED; then


Could we make it more precise, i.e. ^PASSED or like?

This is the output.

[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done

Would ^Test PASSED$ be better?

ggiguash · 2025-03-12T15:50:07Z

scripts/ci-ai-model-serving/tests/06-test-vllm.sh

+        "model": "granite-3b-code-base-2k",
+        "prompt": "Once upon a time,",
+        "max_tokens": 256,
+        "temperature": 0.5}' | jq


Should we check the answer is not empty?

Yeah, that's a good idea for double checking it

ggiguash · 2025-03-12T15:51:37Z

scripts/ci-ai-model-serving/tests/vllm-image/Containerfile

+RUN pip install huggingface-hub
+
+# Download the model file from hugging face
+WORKDIR /tmp


Is the workdir change necessary? The model seems to be downloaded to /model

hadolint complaint:

./scripts/ci-ai-model-serving/tests/vllm-image/Containerfile:13 DL3045 warning: `COPY` to a relative destination without `WORKDIR` set.

So it was either WORKDIR /tmp or

COPY download_model.py /tmp RUN python /tmp/download_model.py

but I figured that WORKDIR would work without retesting (laziness).
Do you prefer adding /tmp to COPY and RUN?

ggiguash · 2025-03-13T15:38:10Z

/lgtm

openshift-ci · 2025-03-13T15:38:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ggiguash, pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ggiguash,pmtk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-03-13T18:51:30Z

@pmtk: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

configure-vm.sh: add --no-start

f6d3beb

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 11, 2025

openshift-ci bot requested review from eslutsky and jogeo March 11, 2025 11:29

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2025

pmtk force-pushed the rhoai/aws-nvidia-test branch from fddec8e to 238e19e Compare March 12, 2025 10:09

ggiguash reviewed Mar 12, 2025

View reviewed changes

pmtk force-pushed the rhoai/aws-nvidia-test branch from 238e19e to ff70ff8 Compare March 13, 2025 11:16

pmtk added 2 commits March 13, 2025 12:18

MicroShift Model Serving x AWS x NVIDIA

40e6a5e

OCI image

f4ab69f

pmtk force-pushed the rhoai/aws-nvidia-test branch from ff70ff8 to f4ab69f Compare March 13, 2025 11:18

openshift-ci bot assigned ggiguash Mar 13, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 13, 2025

openshift-merge-bot bot merged commit d50c1b8 into openshift:main Mar 13, 2025
9 checks passed

pmtk deleted the rhoai/aws-nvidia-test branch March 14, 2025 08:00

	sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
	sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -m)/cuda-rhel9.repo

USHIFT-5339: Test AI Model Serving with NVIDIA GPU #4659

USHIFT-5339: Test AI Model Serving with NVIDIA GPU #4659

Uh oh!

Conversation

pmtk commented Mar 11, 2025

Uh oh!

openshift-ci-robot commented Mar 11, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmtk commented Mar 12, 2025

Uh oh!

pmtk commented Mar 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggiguash commented Mar 13, 2025

Uh oh!

openshift-ci bot commented Mar 13, 2025

Uh oh!

openshift-ci bot commented Mar 13, 2025

Uh oh!

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 11, 2025 •

edited by openshift-ci bot

Loading