-
Notifications
You must be signed in to change notification settings - Fork 772
Description
As has been discussed multiple times in zoom and in chat, merging fosscuda+foss, and just using versionsuffixes for CUDA-variants of the (relatively few) easyconfigs that does have CUDA bindings.
One thing holding us back has been the CUDA support in MPI, which prevents us from moving into foss since it's part of the toolchain definitions itself, but now with UCX, it might be possible to get the best of both worlds (I'm going to just dismiss the legacy CUDA stuff in openmpi and just focus on UCX).
We want something that
- works regardless of RPATH is used or not
- can be opt int after foss(without CUDA) is already in place
- supports all the RDMA goodies we have today in fosscuda.
Can it be done? Maybe; UCX has an environment variable for all the plugins it uses (ucx_info -f lists all variables):
We could introduce a a UCX-package (UCX-CUDA maybe?) which shadows non-CUDA UCX, and we start setting the environment variable;
UCX_MODULE_DIR='%(installdir)s/lib/ucx'and, well, that should be it?
Example UCX-CUDA how i envision it:
easyblock = 'ConfigureMake'
name = 'UCX-CUDA'
version = '1.9.0'
local_cudaversion = '11.1.1'
versionsuffix = '-CUDA-%s' % local_cudaversion
homepage = 'http://www.openucx.org/'
description = """Unified Communication X
An open-source production grade communication framework for data centric
and high-performance applications
"""
toolchain = {'name': 'GCCcore', 'version': '10.2.0'}
toolchainopts = {'pic': True}
source_urls = ['https://github.com/openucx/ucx/releases/download/v%(version)s']
sources = ['%(namelower)s-%(version)s.tar.gz']
checksums = ['a7a2c8841dc0d5444088a4373dc9b9cc68dbffcd917c1eba92ca8ed8e5e635fb']
builddependencies = [
('binutils', '2.35'),
('Autotools', '20200321'),
('pkg-config', '0.29.2'),
]
osdependencies = [OS_PKG_IBVERBS_DEV]
dependencies = [
('UCX', version),
('numactl', '2.0.13'),
('CUDAcore', local_cudaversion, '', True),
('GDRCopy', '2.1', versionsuffix),
]
configure_cmd = "contrib/configure-release"
configopts = '--enable-optimizations --enable-cma --enable-mt --with-verbs '
configopts += '--without-java --disable-doxygen-doc '
configopts += '--with-cuda=$EBROOTCUDACORE --with-gdrcopy=$EBROOTGDRCOPY '
prebuildopts = 'unset CUDA_CFLAGS && unset LIBS && '
buildopts = 'V=1'
# Not a PATH since we want to replace it, not append to it
modextravars = {
'UCX_MODULE_DIR': '%(installdir)s/lib/ucx',
}
sanity_check_paths = {
'files': ['bin/ucx_info', 'bin/ucx_perftest', 'bin/ucx_read_profile'],
'dirs': ['include', 'lib', 'share']
}
sanity_check_commands = ["ucx_info -d"]
moduleclass = 'lib'We'd then just let a TensorFlow-2.4.1-foss-2020b-CUDA-11.1.1.eb have a dependency on UCX-CUDA (at least indirectly) which is probably the ugliesst thing with this approach.
This would remove the need for gcccuda, gompic, fosscuda, and they would all just use suffixes instead (and optionally depend on UCX-CUDA if they have some MPI parts).
@bartoldeman Is this at all close to the approach you envisioned?