-
Notifications
You must be signed in to change notification settings - Fork 772
Description
The error is:
RuntimeError: Encountered unknown error while testing nvcc:
/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-p9/software/GCCcore/8.3.0/include/c++/8.3.0/type_traits(335): error: identifier "__ieee128" is undefined
/usr/include/bits/floatn.h(79): error: identifier "__ieee128" is undefined
/usr/include/bits/floatn.h(82): error: invalid argument to attribute "__mode__"
We've looked at this for TensorFlow 2.3.1 in easybuilders/easybuild-easyblocks#2251 and #11859 - the debugging there leads to easybuilders/easybuild-easyblocks#2251 (comment)
Edit: FTR this is a GLIBC 2.26 issue: https://forums.developer.nvidia.com/t/request-add-nvcc-compatibility-with-glibc-2-26/53306
However, this is more widespread than just TensorFlow. Anywhere NVCC is passing flags back to GCC when we are building on POWER on RedHat 8 with CUDA < 11. (I expect that it'll also impact other OSes as well.)
I did limited testing and I've seen the error with (but there are likely other bits of software this impacts):
- TensorFlow
- PyTorch
- CuPy
- magma
- torchvision
The solution is to get NVCC to pass the -mno-float128 flag (and also -std=c++11 if that is not already there). Depending on where this has to be added varies how complicated doing that is.
The alternative is to build against newer toolchains - 2020a and later, where CUDA 11 is used.