Skip to content

"__ieee128" is undefined on POWER with RedHat 8 and CUDA < 11 #11913

@branfosj

Description

@branfosj

The error is:

  RuntimeError: Encountered unknown error while testing nvcc:
  /rds/bear-apps/devel/eb-sjb-up/EL8/EL8-p9/software/GCCcore/8.3.0/include/c++/8.3.0/type_traits(335): error: identifier "__ieee128" is undefined
  /usr/include/bits/floatn.h(79): error: identifier "__ieee128" is undefined
  /usr/include/bits/floatn.h(82): error: invalid argument to attribute "__mode__"

We've looked at this for TensorFlow 2.3.1 in easybuilders/easybuild-easyblocks#2251 and #11859 - the debugging there leads to easybuilders/easybuild-easyblocks#2251 (comment)

Edit: FTR this is a GLIBC 2.26 issue: https://forums.developer.nvidia.com/t/request-add-nvcc-compatibility-with-glibc-2-26/53306

However, this is more widespread than just TensorFlow. Anywhere NVCC is passing flags back to GCC when we are building on POWER on RedHat 8 with CUDA < 11. (I expect that it'll also impact other OSes as well.)

I did limited testing and I've seen the error with (but there are likely other bits of software this impacts):

  • TensorFlow
  • PyTorch
  • CuPy
  • magma
  • torchvision

The solution is to get NVCC to pass the -mno-float128 flag (and also -std=c++11 if that is not already there). Depending on where this has to be added varies how complicated doing that is.

The alternative is to build against newer toolchains - 2020a and later, where CUDA 11 is used.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions