Skip to content

Enable AVX512 kernels when compiling with nvc (NVIDIA HPC C compiler) #4162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cparrott73 opened this issue Jul 25, 2023 · 24 comments · Fixed by #4164
Closed

Enable AVX512 kernels when compiling with nvc (NVIDIA HPC C compiler) #4162

cparrott73 opened this issue Jul 25, 2023 · 24 comments · Fixed by #4164

Comments

@cparrott73
Copy link

I have a preliminary patch to a bunch of the SKYLAKEX (AVX512) kernels that enables them to be compiled with nvc, the NVIDIA HPC C compiler. Without this patch, OpenBLAS runs about 2x as slow on AVX512 systems when compiled with nvc versus gcc or clang.

Unfortunately, I do not know offhand the exact release of nvc in which the AVX512 intrinsics were enabled, so if you want to further refine this patch to guard a specific nvc version or later, I'll have to do some more investigation. This support has been in our compilers for a while, at least a year or two.

I can follow up with further info if needed.

nvc.patch

@martin-frbg
Copy link
Collaborator

Thanks. Based on an older OpenMPI ticket I found, ths might be from release 22.2 onwards - unfortunately the official release notes are too brief to allow me to verify this easily

@martin-frbg
Copy link
Collaborator

22.3 it is, 22.2 still stumbled. Will apply the updated patch later today.

@cparrott73
Copy link
Author

cparrott73 commented Jul 26, 2023

Terrific! Thank you!

FYI - I have found bugs in our current release when compiling the CASUM and ZASUM kernels for SKYLAKEX. I have an open bug report on this internally. As soon as I can root cause the issue, I will try to get our developers to fix it for the next release. These will show up as wrong answers in the test suite when run on a system that supports AVX512.

For our internal testing, I have temporarily disabled these kernels in the KERNEL.SKYLAKEX file.

@martin-frbg
Copy link
Collaborator

Oops - are they related to the microkernels that your patch enables (as an aside, I notice the casum one is not modified by it) ? And can they be worked around by reducing the optimization level with a suitable pragma ? Certainly do not want to produce wrong answers here, whether in the testsuite or in user code...

@cparrott73
Copy link
Author

I'm not sure yet. Unfortunately, I found this a few weeks ago, but I have been busy with other priorities and have not gotten back to looking at it again until this week. I'll run some experiments and let you know what I find.

Don't change anything on your end just yet until I have a better handle on this.

@cparrott73
Copy link
Author

Here is the symptom:

Test of subprogram number 7 CBLAS_DZASUM
FAIL

CASE N INCX INCY MODE I COMP(I) TRUE(I) DIFFERENCE SIZE(I)

7  1    1 9999 9999  1                      0.00000000D+00                      0.70000000D+00 -0.7000D+00  0.7000D+00
7  2    1 9999 9999  1                      0.00000000D+00                      0.10000000D+01 -0.1000D+01  0.1000D+01
7  3    1 9999 9999  1                      0.00000000D+00                      0.13000000D+01 -0.1300D+01  0.1300D+01
7  4    1 9999 9999  1                      0.00000000D+00                      0.17000000D+01 -0.1700D+01  0.1700D+01

@cparrott73
Copy link
Author

Just tried lowering the opt level to -O1 for these two kernels - did not help.

I'll dive in and see what's going on here.

@martin-frbg
Copy link
Collaborator

Building everything with -O0 appeared to work around the problem in my tests, however adding #pragma opt 0
to casum.c and zasum.c instead did not. (As an aside, the online docs for 23.5 only ever mention that some actions of the compiler can be influenced by pragmas, but do not specify which pragmas are recognized. I finally found the "#pragma opt 0" suggested in a forum reply (where it was not entirely clear if it worked as intended).
Of course there is also the possibility that it may be test cases that are miscompiled rather than the kernels (or I may have overlooked the failure messages in the global O0 build)

@cparrott73
Copy link
Author

@martin-frbg - thanks for the feedback on the docs, I'll pass that along to our docs person.

I've traced the problem to our optimizer miscompiling the casum_microk_skylake-2.c and zasum_microk_skylake-2.c files. We are not loading the correct values into the XMM registers in the case where n = 2. (And probably the others, but that is the case I'm focused on at the moment.)

I'm writing up my findings in my internal bug report on this issue. I'm hoping we can get this fixed for the 23.9 release, which is due out in September. But I'll let you know once I confirm that.

@martin-frbg
Copy link
Collaborator

Hmm, I'm currently playing with this again, and just disabling the casum microkernel did not appear to fix the problem for me. I'll see if I forgot to revert some of the previous changes..

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 28, 2023

So the issue appears to be that nvc actually defines __GNUC__ so that the compiler version guards were already allowing the microkernels to be built even without the proposed patch. What is needed for now is just the opposite - disabling (at least) the casum and zasum microkernels when the compiler also identifies itself as __NVCOMPILER

@cparrott73
Copy link
Author

Oh interesting. Wondering if we recently removed this in our development branch? I'll have to do some further investigation.

Sorry for the trouble.

@cparrott73
Copy link
Author

cparrott73 commented Jul 28, 2023

So I think I understand what's going on here now.

nvc defines __GNUC__ to be the equivalent of the underlying GCC version found on the system.

I have a simple program that dumps out the definition of __GNUC__ when it was compiled:

#include <stdio.h>

int main(void)
{
  printf("__GNUC__ = %d\n", __GNUC__);
  return 0;
}

On an Ubuntu 18.04 system, gcc-7.4.0 is the system default. So this program dumps __GNUC__ == 7 when compiled with nvc:

cparrott@dev-sky5 ~ $ ./gnuc 
__GNUC__ = 7
cparrott@dev-sky5 ~ $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04) 

However, our build system is CentOS 7, for maximum backward-compatibility. gcc-4.8.5 is the system default GCC on this distribution:

cparrott@hsw7 ~ $ ./gnuc
__GNUC__ = 4
cparrott@hsw7 ~ $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) 

However, even if I bring a newer GCC into the $PATH, nvc still refers to the system default version:

cparrott@hsw7 ~ $ module load gcc/13.1.0
cparrott@hsw7 ~ $ nvc -o gnuc gnuc.c
cparrott@hsw7 ~ $ ./gnuc
__GNUC__ = 4
cparrott@hsw7 ~ $ gcc -v
Reading specs from /home/sw/thirdparty/gcc/gcc-13.1.0/Linux_x86_64/lib/gcc/x86_64-pc-linux-gnu/13.1.0/specs
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/sw/thirdparty/gcc/gcc-13.1.0/Linux_x86_64/libexec/gcc/x86_64-pc-linux-gnu/13.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-13.1.0/configure --prefix=/home/sw/thirdparty/gcc/gcc-13.1.0/Linux_x86_64 --enable-languages=c,c++,fortran,lto,objc,obj-c++,d,ada,go,m2 --enable-shared --enable-threads=posix --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --disable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 13.1.0 (GCC) 

This explains why the kernels were not enabled on our build system.

So I'm thinking testing on __GNUC__ for version number may not be particularly useful for nvc here. Thoughts?

@cparrott73
Copy link
Author

Sorry for the bad formatting above - can't figure out how to make github make it look right here.

@martin-frbg
Copy link
Collaborator

You can also use nvc -dM -E with some random C source file like OpenBLAS' ctest.c to see which macros it defines. Indeed checking __GNUC__ only seems to make sense when one can be sure that the compiler actually is gcc, or at least tries to be bug-compatible with the version it masquerades as. The current ifdefs may be a bit too naive in this regard, but they date from the early days of AVX512 support where the important question was if the compiler would cope with these new instructions at all.
(use backticks to achieve somewhat sane formatting, or three backticks in a row for a block of verbatim text)

@cparrott73
Copy link
Author

Thanks - edited the above comment to make it more readable.

@martin-frbg
Copy link
Collaborator

Will have to check if e.g. the Intel compiler defines __GNUC__ as well (wouldn't surprise me now if it did...) and try to come up with a new ifdef that does not take ten lines to cater to every compiler under the sun...

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 28, 2023

so current Intel defines __GNUC__ as well but sets it to 4, meaning they rather than you would need extra help to access the more performant code.... what a mess :(
EDIT: Intel additionally defines __clang__ so is in the clear.

@martin-frbg
Copy link
Collaborator

Interestingly, a similar miscompilation of the casum/zasum microkernels appears to be present in the current release candidate of LLVM17 as well.

@bartoldeman
Copy link
Contributor

It seems NVHPC 23.09 still has this issue, the lapack testcase has a huge number of failures unless I replace 2309 by e.g. 2311 or some other higher number.

@martin-frbg
Copy link
Collaborator

Hmm that's bad. 23.11 just came out, I'll test ASAP

@bartoldeman
Copy link
Contributor

I just tested 23.11 and it still has the exact same issue.

 Test of subprogram number  7         CBLAS_DZASUM   
                                       FAIL

 CASE  N INCX INCY MODE  I                             COMP(I)                             TRUE(I)  DIFFERENCE     SIZE(I)

    7  1    1 9999 9999  1                      0.00000000D+00                      0.70000000D+00 -0.7000D+00  0.7000D+00
    7  2    1 9999 9999  1                      0.00000000D+00                      0.10000000D+01 -0.1000D+01  0.1000D+01
    7  3    1 9999 9999  1                      0.00000000D+00                      0.13000000D+01 -0.1300D+01  0.1300D+01
    7  4    1 9999 9999  1                      0.00000000D+00                      0.17000000D+01 -0.1700D+01  0.1700D+01

@martin-frbg
Copy link
Collaborator

Me too but dinnertime so not allowed too push a PR right away :)

@bartoldeman
Copy link
Contributor

In the end I think the compiler (and icx as well, also LLVM-based) simply optimized out the code since it was dodgy:
#4330

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants