Skip to content

Conversation

@Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Oct 17, 2023

(created using eb --new-pr)

I found this bug when the sanity check (ucx_info -d) crashed for UCX-CUDA on PPC which I traced to a bug in core UCX.

I added a patch to fix this and verified that UCX-CUDA now works even on PPC.

Analysis of the issue is in short in the patch and longer in openucx/ucx#9392

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
taurusml24 - Linux RHEL 7.6, POWER, 8335-GTX, 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/9333102ef6f626248705fdbcc51925a5 for a full test report.

@Micket Micket added the bug fix label Oct 17, 2023
@Micket Micket added this to the next release (4.8.2?) milestone Oct 17, 2023
@Micket
Copy link
Contributor

Micket commented Oct 17, 2023

Test report by @Micket
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
alvis-c1 - Linux Rocky Linux 8.8, x86_64, Intel Xeon Processor (Skylake), Python 3.6.8
See https://gist.github.com/Micket/45c10c0dd9d9b9c71d2a778c0cbd3c57 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
taurusi8017 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor, 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/906544173201ec9c274e3b6224c2703f for a full test report.

@Micket
Copy link
Contributor

Micket commented Oct 17, 2023

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@Micket: Request for testing this PR well received on login1

PR test command 'EB_PR=19023 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19023 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 11950

Test results coming soon (I hope)...

Details

- notification for comment with ID 1766878462 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@Micket
Copy link
Contributor

Micket commented Oct 17, 2023

@boegelbot please test @ jsc-zen2

@boegelbot
Copy link
Collaborator

@Micket: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=19023 EB_ARGS= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_19023 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3566

Test results coming soon (I hope)...

Details

- notification for comment with ID 1766887430 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/8dc1f29d555a774ad75648d068ac893c for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/bc1349d4075ec8a3268b11fa81167b25 for a full test report.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Oct 17, 2023

Test report by @boegel
SUCCESS
Build succeeded for 6 out of 6 (6 easyconfigs in total)
node3130.skitty.os - Linux RHEL 8.8, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/boegel/8d75366b43d3955d147a42800d3e8623 for a full test report.

@boegel
Copy link
Member

boegel commented Oct 17, 2023

Going in, thanks @Flamefire!

@boegel boegel merged commit 03affe1 into easybuilders:develop Oct 17, 2023
@Flamefire Flamefire deleted the 20231017130616_new_pr_UCX1110 branch October 17, 2023 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants