Skip to content

Merging CUDA and non CUDA toolchains into one #12484

@Micket

Description

@Micket

As has been discussed multiple times in zoom and in chat, merging fosscuda+foss, and just using versionsuffixes for CUDA-variants of the (relatively few) easyconfigs that does have CUDA bindings.

One thing holding us back has been the CUDA support in MPI, which prevents us from moving into foss since it's part of the toolchain definitions itself, but now with UCX, it might be possible to get the best of both worlds (I'm going to just dismiss the legacy CUDA stuff in openmpi and just focus on UCX).

We want something that

  1. works regardless of RPATH is used or not
  2. can be opt int after foss(without CUDA) is already in place
  3. supports all the RDMA goodies we have today in fosscuda.

Can it be done? Maybe; UCX has an environment variable for all the plugins it uses (ucx_info -f lists all variables):
We could introduce a a UCX-package (UCX-CUDA maybe?) which shadows non-CUDA UCX, and we start setting the environment variable;

UCX_MODULE_DIR='%(installdir)s/lib/ucx'

and, well, that should be it?

Example UCX-CUDA how i envision it:

easyblock = 'ConfigureMake'

name = 'UCX-CUDA'
version = '1.9.0'
local_cudaversion = '11.1.1'
versionsuffix = '-CUDA-%s' % local_cudaversion

homepage = 'http://www.openucx.org/'
description = """Unified Communication X
An open-source production grade communication framework for data centric
and high-performance applications
"""

toolchain = {'name': 'GCCcore', 'version': '10.2.0'}
toolchainopts = {'pic': True}

source_urls = ['https://github.com/openucx/ucx/releases/download/v%(version)s']
sources = ['%(namelower)s-%(version)s.tar.gz']
checksums = ['a7a2c8841dc0d5444088a4373dc9b9cc68dbffcd917c1eba92ca8ed8e5e635fb']

builddependencies = [
    ('binutils', '2.35'),
    ('Autotools', '20200321'),
    ('pkg-config', '0.29.2'),
]

osdependencies = [OS_PKG_IBVERBS_DEV]

dependencies = [
    ('UCX', version),
    ('numactl', '2.0.13'),
    ('CUDAcore', local_cudaversion, '', True),
    ('GDRCopy', '2.1', versionsuffix),
]

configure_cmd = "contrib/configure-release"
configopts = '--enable-optimizations --enable-cma --enable-mt --with-verbs '
configopts += '--without-java --disable-doxygen-doc '
configopts += '--with-cuda=$EBROOTCUDACORE --with-gdrcopy=$EBROOTGDRCOPY '

prebuildopts = 'unset CUDA_CFLAGS && unset LIBS && '
buildopts = 'V=1'

# Not a PATH since we want to replace it, not append to it
modextravars = {
    'UCX_MODULE_DIR': '%(installdir)s/lib/ucx',
}

sanity_check_paths = {
    'files': ['bin/ucx_info', 'bin/ucx_perftest', 'bin/ucx_read_profile'],
    'dirs': ['include', 'lib', 'share']
}

sanity_check_commands = ["ucx_info -d"]

moduleclass = 'lib'

We'd then just let a TensorFlow-2.4.1-foss-2020b-CUDA-11.1.1.eb have a dependency on UCX-CUDA (at least indirectly) which is probably the ugliesst thing with this approach.

This would remove the need for gcccuda, gompic, fosscuda, and they would all just use suffixes instead (and optionally depend on UCX-CUDA if they have some MPI parts).

@bartoldeman Is this at all close to the approach you envisioned?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions