Skip to content

Conversation

@akesandgren
Copy link
Contributor

@akesandgren akesandgren commented Jul 7, 2023

(created using eb --new-pr)

This is probably not enough for a fully working ROCm support, but it's at least part of it.

Depends on:

@akesandgren
Copy link
Contributor Author

Test report by @akesandgren
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2958
SUCCESS
Build succeeded for 5 out of 5 (5 easyconfigs in total)
b-cn1605.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, Python 3.10.6
See https://gist.github.com/akesandgren/495e74a6ad650125a60eaded50177865 for a full test report.

'dirs': [],
}
sanity_check_commands = [
'rocminfo --help',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails for me:

$ rocminfo --help
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
branfosj is not member of "video" group, the default DRM access group. Users must be a member of the "video" group or another DRM access group in order for ROCm applications to run successfully.

Not sure if this means this package is unsuitable for an easyconfig or if we sould skip this sanity check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as for CUDA the AMDGPU device needs proper permissions.

We have this:

cat /etc/udev/rules.d/71-amdgpu.rules 
# Fix AMD GPU device permissions
#
# This file is managed by puppet, local changes will be overwritten.
#
SUBSYSTEM=="drm", KERNEL=="renderD*", GROUP="render", MODE="0666"
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it, you would need to explicitly add all users that want to use it to the "video" group

"-DCLR_BUILD_HIP=ON",
"-DCLR_BUILD_OCL=OFF",
"-DOFFLOAD_ARCH_STR='--offload-arch=%s'" % local_default_gfx,
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this EasyConfig worked pretty well for me, except that also here resulting libraries like libamdhip64.so showed to be linked to system libraries (like /lib64/libnuma.so.1, /lib64/libstdc++.so.6, ...). Do you see this as well? From what I could tell this is due to stripping of rpath information in the install step. Adding an extra "-DCMAKE_SKIP_RPATH=ON", option here prevents this from happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at both of these as soon as I find some time. I think there is something called Christmas coming up fairly soon...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no trace of this once the module is loaded.

Did you run the ldd without the module loaded?

CMAKE_SKIP_RPATH is only set to ON in the cmakemake easyblock if CMake < 3.5.0 and build_option('rpath') is set. And for this one CMake is 3.23...

Copy link
Contributor

@MaximeVdB MaximeVdB Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running ldd after loading the module. After having a closer look I also managed to resolve it via the following patch (so instead of introducing -DCMAKE_SKIP_RPATH=ON in the easyconfig):

--- clr-rocm-5.6.0/hipamd/CMakeLists.txt.orig   2024-01-09 12:13:38.465256148 +0100
+++ clr-rocm-5.6.0/hipamd/CMakeLists.txt        2024-01-09 12:18:05.463862147 +0100
@@ -41,7 +41,7 @@

 # required to add the right link to libhsa-runtime in install/lib path
 # CMAKE_PREFIX_PATH is used as rpath to search for libs outside HIP
-set(CMAKE_INSTALL_RPATH "${CMAKE_PREFIX_PATH}/${CMAKE_INSTALL_LIBDIR}")
+set(CMAKE_INSTALL_RPATH "${CMAKE_PREFIX_PATH}")
 set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)

 #############################

To me, this set(CMAKE_INSTALL_RPATH "${CMAKE_PREFIX_PATH}/${CMAKE_INSTALL_LIBDIR}") line looked strange because EasyBuild sets ${CMAKE_INSTALL_RPATH} equal to a list of semicolon-separated software directories. So in this case appending something like "/lib" to it and using that as CMAKE_INSTALL_RPATH does not seem quite right. It appears that this eventually caused the following kind of runtime path modifications in the install step:

...
-- Installing: <snip>/2022a/software/HIP/5.6.0-GCCcore-11.3.0/lib64/libhiprtc-builtins.so.5.6.31061
-- Installing: <snip>/2022a/software/HIP/5.6.0-GCCcore-11.3.0/lib64/libhiprtc-builtins.so.5
-- Set runtime path of "<snip>/2022a/software/HIP/5.6.0-GCCcore-11.3.0/lib64/libhiprtc-builtins.so.5.6.31061" to "/lib64"
...

which then led to the installed libraries linking to system libraries.

At the moment I don't really have an idea why it is not occurring on your side. I suppose that you don't see Set runtime path ... messages in the EasyBuild logfile? Could you perhaps share that logfile?

(disclaimer: my CMake knowledge is rather limited)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do indeed see the same problem when I look at the files correctly, and the "Set runtime path" message. I'll try the patch and will check again...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, with that patch there are no longer any "Set runtime path" messages and RPATH info in at least libamdhip64.so is empty,
Now I'll try to rebuild it with --rpath set...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to do good things with --rpath too.

@MaximeVdB
Copy link
Contributor

One more thing (since you're touching upon rocm-smi and ROCm support):

For ROCm 4.5.0 there is a ROCm-4.5.0-GCCcore-11.2.0 easyconfig which bundles HIP, Clang-AOMP, ROCR-Runtime, etcetera. In the present PR, more components got added to the HIP easyconfig and loading the resulting HIP module will seemingly provide everything that such a ROCm module would provide, except for rocm-smi.

So... is the idea to no longer provide ROCm easyconfigs and that one should use HIP as dependency instead (and rocm-smi if needed)?

@akesandgren
Copy link
Contributor Author

I do have a ROCm EC for 5.6.0 prepared, the reason I haven't submitted is that I wasn't sure if anything was still missing from it.
And it currently contains more than it needs to. I'll clean it up and push it.

@akesandgren
Copy link
Contributor Author

@MaximeVdB See #19591

@hattom
Copy link
Contributor

hattom commented Jan 17, 2024

I installed Clang-AOMP/5.6.0, which includes a sanity check command on flang in the easyblock.
Then I recently tried to build something with flang, which complains about missing flang1 and flang2 commands.

Can you check if it's just me, or if flang is broken for you too?
Also:
a) maybe someone knows how to install those components (apparently they come from openmp-extras, but I can't find where that lives).
b) could the sanity-check command build a hello-world program w/ flang?

@akesandgren
Copy link
Contributor Author

The easyblock for Clang-AOMP (clang_aomp.py) does not contain a sanity check for flang.
It's is currently not supported.

I'm currently updating the easyblock (which isn't merged yet) to build some missing parts.

@hattom
Copy link
Contributor

hattom commented Jan 17, 2024

Thanks, then I must be getting confused again.
Sorry for the mixup.

@akesandgren
Copy link
Contributor Author

Test report by @akesandgren
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2958
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 525.147.05, Python 3.10.12
See https://gist.github.com/akesandgren/250e82c8932a1f32495c16138d108d11 for a full test report.

@akesandgren
Copy link
Contributor Author

Test report by @akesandgren
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2958
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 525.147.05, Python 3.10.12
See https://gist.github.com/akesandgren/9ff8810e7c1541fbdeb627fa765394ac for a full test report.

@hattom
Copy link
Contributor

hattom commented Jan 17, 2024

As discussed today at the CC, I was looking at looking at the aomp easyblock (where there is a flang --help sanity check).

@boegel boegel changed the title {tools}[GCCcore/11.3.0] HIP v5.6.0 w/ amd {tools}[GCCcore/11.3.0] HIP v5.6.0, Clang-AOMP v5.6.0, ROCM-CompilerSupport v5.6.0, rocm-cmake v5.6.0, rocminfo v5.6.0 Feb 8, 2024
@boegel
Copy link
Member

boegel commented Feb 8, 2024

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 2958"

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=18277 EB_ARGS="--include-easyblocks-from-pr 2958" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_18277 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12858

Test results coming soon (I hope)...

Details

- notification for comment with ID 1933746643 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

boegelbot commented Feb 8, 2024

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2958
FAILED
Build succeeded for 4 out of 6 (5 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/db6b9d754af81e1ba97233c8b2d1e95f for a full test report.

edit: doesn't work because no (AMD?) GPU is available:

Sanity check failed: sanity check command rocminfo --help exited with code 1 (output: �[31mROCk module is NOT loaded, possibly no GPU devices�[0m

@boegel
Copy link
Member

boegel commented Sep 24, 2024

Trying to build this on LUMI, hitting the following error:

/project/project_465000844/easybuild/lumi-c/software/binutils/2.38-GCCcore-11.3.0/bin/ld: /opt/rocm/lib/libhsa-runtime64.so.1.12.60003: undefined reference to `std::condition_variable::wait(std::unique_lock<std::mutex>&)@GLIBCXX_3.4.30'
collect2: error: ld returned 1 exit status
make[2]: *** [tools/clang/tools/amdgpu-arch/CMakeFiles/amdgpu-arch.dir/build.make:106: bin/amdgpu-arch] Error 1

@akesandgren @klust @gmarkomanolis Does this happen to ring a bell for you?
Is it picking up a library available in the OS it shouldn't be, or is it supposed to pick up libhsa-runtime64.so from there?

@boegel
Copy link
Member

boegel commented Sep 24, 2024

Trying to build this on LUMI, hitting the following error:

/project/project_465000844/easybuild/lumi-c/software/binutils/2.38-GCCcore-11.3.0/bin/ld: /opt/rocm/lib/libhsa-runtime64.so.1.12.60003: undefined reference to `std::condition_variable::wait(std::unique_lock<std::mutex>&)@GLIBCXX_3.4.30'
collect2: error: ld returned 1 exit status
make[2]: *** [tools/clang/tools/amdgpu-arch/CMakeFiles/amdgpu-arch.dir/build.make:106: bin/amdgpu-arch] Error 1

@akesandgren @klust @gmarkomanolis Does this happen to ring a bell for you? Is it picking up a library available in the OS it shouldn't be, or is it supposed to pick up libhsa-runtime64.so from there?

Others have hit this too, see E3SM-Project/scorpio#512 (comment)

@akesandgren
Copy link
Contributor Author

@akesandgren @klust @gmarkomanolis Does this happen to ring a bell for you? Is it picking up a library available in the OS it shouldn't be, or is it supposed to pick up libhsa-runtime64.so from there?

It is not supposed to pick it up from there, at least it wasn't my intention :-)
libhsa-runtime64 is in Clang-AOMP. Which part is it that fails here?

@boegel
Copy link
Member

boegel commented Sep 24, 2024

@akesandgren @klust @gmarkomanolis Does this happen to ring a bell for you? Is it picking up a library available in the OS it shouldn't be, or is it supposed to pick up libhsa-runtime64.so from there?

It is not supposed to pick it up from there, at least it wasn't my intention :-) libhsa-runtime64 is in Clang-AOMP. Which part is it that fails here?

The linking error happens when building Clang-AOMP-5.6.0-GCCcore-11.3.0.eb.
There's no libhsa*so*(yet) in the build directory at that point.

@boegel
Copy link
Member

boegel commented Sep 24, 2024

I found one configuration option that may be relevant:

//Build with dlopened libhsa
AMDGPU_ARCH_FORCE_DLOPEN_LIBHSA:BOOL=OFF

Perhaps we need to enable that to dance around this problem?

@boegel
Copy link
Member

boegel commented Sep 24, 2024

Trying with this modification:

components = [
    ('llvm-project', 'rocm-%s' % _rocm_version, {
        'checksums': ['e922bd492b54d99e56ed88c81e2009ed6472059a180b10cc56ce1f9bd2d7b6ed'],
        'configopts': "-DAMDGPU_ARCH_FORCE_DLOPEN_LIBHSA=ON",
    }),

@gmarkomanolis
Copy link

I searched a bit:

"I guess since GLIBCXX_3.4.32 supports std::condition_variable::wait, upgrading the library would be the best solution. From AMD side turning off LLVM multithreading enables using older libraries."

ROCm/ROCm#2084

@boegel
Copy link
Member

boegel commented Sep 24, 2024

I searched a bit:

"I guess since GLIBCXX_3.4.32 supports std::condition_variable::wait, upgrading the library would be the best solution. From AMD side turning off LLVM multithreading enables using older libraries."

ROCm/ROCm#2084

Thanks @gmarkomanolis, that's really helpful!

That issue tells that trying to build stuff with GCC 11.3.0 on top of a system-wide ROCm installation where things were built with GCC 12.2.0 is sort of a lost cause...

Question remains why /opt/rocm/lib/libhsa-runtime64.so.* is being picked up at all though, seems like we should try to avoid that?

@Thyre Thyre added the 2022a label Aug 18, 2025
@boegel
Copy link
Member

boegel commented Oct 13, 2025

No longer relevant since 2022a toolchains are deprecated since foss/2025b was defined, see also https://docs.easybuild.io/policies/toolchains/, so closing...

@boegel boegel closed this Oct 13, 2025
@akesandgren akesandgren deleted the 20230707085921_new_pr_HIP560 branch October 14, 2025 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants