-
Notifications
You must be signed in to change notification settings - Fork 772
{tools}[GCCcore/11.3.0] HIP v5.6.0, Clang-AOMP v5.6.0, ROCM-CompilerSupport v5.6.0, rocm-cmake v5.6.0, rocminfo v5.6.0 #18277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…rocm-5.6.0_fix_bad_message.patch
|
Test report by @akesandgren |
| 'dirs': [], | ||
| } | ||
| sanity_check_commands = [ | ||
| 'rocminfo --help', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fails for me:
$ rocminfo --help
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
branfosj is not member of "video" group, the default DRM access group. Users must be a member of the "video" group or another DRM access group in order for ROCm applications to run successfully.
Not sure if this means this package is unsuitable for an easyconfig or if we sould skip this sanity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as for CUDA the AMDGPU device needs proper permissions.
We have this:
cat /etc/udev/rules.d/71-amdgpu.rules
# Fix AMD GPU device permissions
#
# This file is managed by puppet, local changes will be overwritten.
#
SUBSYSTEM=="drm", KERNEL=="renderD*", GROUP="render", MODE="0666"
SUBSYSTEM=="kfd", GROUP="render", MODE="0666"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without it, you would need to explicitly add all users that want to use it to the "video" group
easybuild/easyconfigs/c/Clang-AOMP/Clang-AOMP-5.6.0-GCCcore-11.3.0.eb
Outdated
Show resolved
Hide resolved
| "-DCLR_BUILD_HIP=ON", | ||
| "-DCLR_BUILD_OCL=OFF", | ||
| "-DOFFLOAD_ARCH_STR='--offload-arch=%s'" % local_default_gfx, | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this EasyConfig worked pretty well for me, except that also here resulting libraries like libamdhip64.so showed to be linked to system libraries (like /lib64/libnuma.so.1, /lib64/libstdc++.so.6, ...). Do you see this as well? From what I could tell this is due to stripping of rpath information in the install step. Adding an extra "-DCMAKE_SKIP_RPATH=ON", option here prevents this from happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take a look at both of these as soon as I find some time. I think there is something called Christmas coming up fairly soon...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see no trace of this once the module is loaded.
Did you run the ldd without the module loaded?
CMAKE_SKIP_RPATH is only set to ON in the cmakemake easyblock if CMake < 3.5.0 and build_option('rpath') is set. And for this one CMake is 3.23...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was running ldd after loading the module. After having a closer look I also managed to resolve it via the following patch (so instead of introducing -DCMAKE_SKIP_RPATH=ON in the easyconfig):
--- clr-rocm-5.6.0/hipamd/CMakeLists.txt.orig 2024-01-09 12:13:38.465256148 +0100
+++ clr-rocm-5.6.0/hipamd/CMakeLists.txt 2024-01-09 12:18:05.463862147 +0100
@@ -41,7 +41,7 @@
# required to add the right link to libhsa-runtime in install/lib path
# CMAKE_PREFIX_PATH is used as rpath to search for libs outside HIP
-set(CMAKE_INSTALL_RPATH "${CMAKE_PREFIX_PATH}/${CMAKE_INSTALL_LIBDIR}")
+set(CMAKE_INSTALL_RPATH "${CMAKE_PREFIX_PATH}")
set(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)
#############################To me, this set(CMAKE_INSTALL_RPATH "${CMAKE_PREFIX_PATH}/${CMAKE_INSTALL_LIBDIR}") line looked strange because EasyBuild sets ${CMAKE_INSTALL_RPATH} equal to a list of semicolon-separated software directories. So in this case appending something like "/lib" to it and using that as CMAKE_INSTALL_RPATH does not seem quite right. It appears that this eventually caused the following kind of runtime path modifications in the install step:
...
-- Installing: <snip>/2022a/software/HIP/5.6.0-GCCcore-11.3.0/lib64/libhiprtc-builtins.so.5.6.31061
-- Installing: <snip>/2022a/software/HIP/5.6.0-GCCcore-11.3.0/lib64/libhiprtc-builtins.so.5
-- Set runtime path of "<snip>/2022a/software/HIP/5.6.0-GCCcore-11.3.0/lib64/libhiprtc-builtins.so.5.6.31061" to "/lib64"
...
which then led to the installed libraries linking to system libraries.
At the moment I don't really have an idea why it is not occurring on your side. I suppose that you don't see Set runtime path ... messages in the EasyBuild logfile? Could you perhaps share that logfile?
(disclaimer: my CMake knowledge is rather limited)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do indeed see the same problem when I look at the files correctly, and the "Set runtime path" message. I'll try the patch and will check again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, with that patch there are no longer any "Set runtime path" messages and RPATH info in at least libamdhip64.so is empty,
Now I'll try to rebuild it with --rpath set...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to do good things with --rpath too.
|
One more thing (since you're touching upon rocm-smi and ROCm support): For ROCm 4.5.0 there is a ROCm-4.5.0-GCCcore-11.2.0 easyconfig which bundles HIP, Clang-AOMP, ROCR-Runtime, etcetera. In the present PR, more components got added to the So... is the idea to no longer provide |
|
I do have a ROCm EC for 5.6.0 prepared, the reason I haven't submitted is that I wasn't sure if anything was still missing from it. |
|
@MaximeVdB See #19591 |
|
I installed Clang-AOMP/5.6.0, which includes a sanity check command on flang in the easyblock. Can you check if it's just me, or if flang is broken for you too? |
|
The easyblock for Clang-AOMP (clang_aomp.py) does not contain a sanity check for flang. I'm currently updating the easyblock (which isn't merged yet) to build some missing parts. |
|
Thanks, then I must be getting confused again. |
|
Test report by @akesandgren |
|
Test report by @akesandgren |
|
As discussed today at the CC, I was looking at looking at the |
|
@boegelbot please test @ generoso |
|
@boegel: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 1933746643 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot edit: doesn't work because no (AMD?) GPU is available: |
|
Trying to build this on LUMI, hitting the following error: @akesandgren @klust @gmarkomanolis Does this happen to ring a bell for you? |
Others have hit this too, see E3SM-Project/scorpio#512 (comment) |
It is not supposed to pick it up from there, at least it wasn't my intention :-) |
The linking error happens when building |
|
I found one configuration option that may be relevant: Perhaps we need to enable that to dance around this problem? |
|
Trying with this modification: components = [
('llvm-project', 'rocm-%s' % _rocm_version, {
'checksums': ['e922bd492b54d99e56ed88c81e2009ed6472059a180b10cc56ce1f9bd2d7b6ed'],
'configopts': "-DAMDGPU_ARCH_FORCE_DLOPEN_LIBHSA=ON",
}), |
|
I searched a bit: "I guess since GLIBCXX_3.4.32 supports std::condition_variable::wait, upgrading the library would be the best solution. From AMD side turning off LLVM multithreading enables using older libraries." |
Thanks @gmarkomanolis, that's really helpful! That issue tells that trying to build stuff with GCC 11.3.0 on top of a system-wide ROCm installation where things were built with GCC 12.2.0 is sort of a lost cause... Question remains why |
|
No longer relevant since |
(created using
eb --new-pr)This is probably not enough for a fully working ROCm support, but it's at least part of it.
Depends on:
$EBROOTGCCand$EBROOTGCCCOREto specify-DGCC_INSTALL_PREFIXeasybuild-easyblocks#2958rocm-smi is in: