Skip to content

Conversation

pmtk
Copy link
Member

@pmtk pmtk commented Mar 11, 2025

No description provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 11, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Mar 11, 2025

@pmtk: This pull request references USHIFT-5339 which is a valid jira issue.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from eslutsky and jogeo March 11, 2025 11:29
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 11, 2025
@pmtk
Copy link
Member Author

pmtk commented Mar 12, 2025

/retest

@pmtk pmtk force-pushed the rhoai/aws-nvidia-test branch from fddec8e to 238e19e Compare March 12, 2025 10:09
@pmtk
Copy link
Member Author

pmtk commented Mar 12, 2025

/retest


set -xeuo pipefail

sudo reboot now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure reboot accepts arguments. I think just sudo reboot would be enough.

--enable=codeready-builder-for-rhel-9-x86_64-rpms

sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/$(uname -m)/cuda-rhel9.repo

capacity=$(oc get node -o json | jq -r '.items[0].status.capacity')
gpus=$(echo "${capacity}" | jq -r '."nvidia.com/gpu"')

if [[ "${gpus}" != "1" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that it's not empty? I mean there may be more than one GPUs on the node.


# Get the logs
pod=$(oc get pods -n cuda-test --selector=batch.kubernetes.io/job-name=test-cuda-vector-add --output=jsonpath='{.items[*].metadata.name}')
logs=$(oc logs -n cuda-test "${pod}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit wary of storing entire log outputs in a variable. Should we use a tmp file instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only like 6 lines, but sure we can do it

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

oc delete job -n cuda-test test-cuda-vector-add
oc delete ns cuda-test

if ! echo "${logs}" | grep -q PASSED; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make it more precise, i.e. ^PASSED or like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the output.

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Would ^Test PASSED$ be better?

"model": "granite-3b-code-base-2k",
"prompt": "Once upon a time,",
"max_tokens": 256,
"temperature": 0.5}' | jq
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check the answer is not empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good idea for double checking it

RUN pip install huggingface-hub

# Download the model file from hugging face
WORKDIR /tmp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the workdir change necessary? The model seems to be downloaded to /model

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadolint complaint:

 ./scripts/ci-ai-model-serving/tests/vllm-image/Containerfile:13 DL3045 warning: `COPY` to a relative destination without `WORKDIR` set. 

So it was either WORKDIR /tmp or

COPY download_model.py /tmp
RUN python /tmp/download_model.py

but I figured that WORKDIR would work without retesting (laziness).
Do you prefer adding /tmp to COPY and RUN?

@pmtk pmtk force-pushed the rhoai/aws-nvidia-test branch from 238e19e to ff70ff8 Compare March 13, 2025 11:16
@pmtk pmtk force-pushed the rhoai/aws-nvidia-test branch from ff70ff8 to f4ab69f Compare March 13, 2025 11:18
@ggiguash
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 13, 2025
Copy link
Contributor

openshift-ci bot commented Mar 13, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ggiguash, pmtk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Mar 13, 2025

@pmtk: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit d50c1b8 into openshift:main Mar 13, 2025
9 checks passed
@pmtk pmtk deleted the rhoai/aws-nvidia-test branch March 14, 2025 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants