Enable AVX512 kernels when compiling with nvc (NVIDIA HPC C compiler) #4162

cparrott73 · 2023-07-25T18:39:38Z

I have a preliminary patch to a bunch of the SKYLAKEX (AVX512) kernels that enables them to be compiled with nvc, the NVIDIA HPC C compiler. Without this patch, OpenBLAS runs about 2x as slow on AVX512 systems when compiled with nvc versus gcc or clang.

Unfortunately, I do not know offhand the exact release of nvc in which the AVX512 intrinsics were enabled, so if you want to further refine this patch to guard a specific nvc version or later, I'll have to do some more investigation. This support has been in our compilers for a while, at least a year or two.

I can follow up with further info if needed.

nvc.patch

martin-frbg · 2023-07-25T18:57:46Z

Thanks. Based on an older OpenMPI ticket I found, ths might be from release 22.2 onwards - unfortunately the official release notes are too brief to allow me to verify this easily

martin-frbg · 2023-07-26T14:52:00Z

22.3 it is, 22.2 still stumbled. Will apply the updated patch later today.

cparrott73 · 2023-07-26T14:54:37Z

Terrific! Thank you!

FYI - I have found bugs in our current release when compiling the CASUM and ZASUM kernels for SKYLAKEX. I have an open bug report on this internally. As soon as I can root cause the issue, I will try to get our developers to fix it for the next release. These will show up as wrong answers in the test suite when run on a system that supports AVX512.

For our internal testing, I have temporarily disabled these kernels in the KERNEL.SKYLAKEX file.

martin-frbg · 2023-07-26T15:04:59Z

Oops - are they related to the microkernels that your patch enables (as an aside, I notice the casum one is not modified by it) ? And can they be worked around by reducing the optimization level with a suitable pragma ? Certainly do not want to produce wrong answers here, whether in the testsuite or in user code...

cparrott73 · 2023-07-26T15:09:42Z

I'm not sure yet. Unfortunately, I found this a few weeks ago, but I have been busy with other priorities and have not gotten back to looking at it again until this week. I'll run some experiments and let you know what I find.

Don't change anything on your end just yet until I have a better handle on this.

cparrott73 · 2023-07-26T15:10:43Z

Here is the symptom:

Test of subprogram number 7 CBLAS_DZASUM
FAIL

CASE N INCX INCY MODE I COMP(I) TRUE(I) DIFFERENCE SIZE(I)

7  1    1 9999 9999  1                      0.00000000D+00                      0.70000000D+00 -0.7000D+00  0.7000D+00
7  2    1 9999 9999  1                      0.00000000D+00                      0.10000000D+01 -0.1000D+01  0.1000D+01
7  3    1 9999 9999  1                      0.00000000D+00                      0.13000000D+01 -0.1300D+01  0.1300D+01
7  4    1 9999 9999  1                      0.00000000D+00                      0.17000000D+01 -0.1700D+01  0.1700D+01

cparrott73 · 2023-07-26T15:42:03Z

Just tried lowering the opt level to -O1 for these two kernels - did not help.

I'll dive in and see what's going on here.

martin-frbg · 2023-07-27T21:36:31Z

Building everything with -O0 appeared to work around the problem in my tests, however adding #pragma opt 0
to casum.c and zasum.c instead did not. (As an aside, the online docs for 23.5 only ever mention that some actions of the compiler can be influenced by pragmas, but do not specify which pragmas are recognized. I finally found the "#pragma opt 0" suggested in a forum reply (where it was not entirely clear if it worked as intended).
Of course there is also the possibility that it may be test cases that are miscompiled rather than the kernels (or I may have overlooked the failure messages in the global O0 build)

cparrott73 · 2023-07-28T16:12:18Z

@martin-frbg - thanks for the feedback on the docs, I'll pass that along to our docs person.

I've traced the problem to our optimizer miscompiling the casum_microk_skylake-2.c and zasum_microk_skylake-2.c files. We are not loading the correct values into the XMM registers in the case where n = 2. (And probably the others, but that is the case I'm focused on at the moment.)

I'm writing up my findings in my internal bug report on this issue. I'm hoping we can get this fixed for the 23.9 release, which is due out in September. But I'll let you know once I confirm that.

martin-frbg · 2023-07-28T16:34:14Z

Hmm, I'm currently playing with this again, and just disabling the casum microkernel did not appear to fix the problem for me. I'll see if I forgot to revert some of the previous changes..

martin-frbg · 2023-07-28T18:39:23Z

So the issue appears to be that nvc actually defines __GNUC__ so that the compiler version guards were already allowing the microkernels to be built even without the proposed patch. What is needed for now is just the opposite - disabling (at least) the casum and zasum microkernels when the compiler also identifies itself as __NVCOMPILER

cparrott73 · 2023-07-28T19:51:44Z

Oh interesting. Wondering if we recently removed this in our development branch? I'll have to do some further investigation.

Sorry for the trouble.

cparrott73 · 2023-07-28T20:02:46Z

So I think I understand what's going on here now.

nvc defines __GNUC__ to be the equivalent of the underlying GCC version found on the system.

I have a simple program that dumps out the definition of __GNUC__ when it was compiled:

#include <stdio.h>

int main(void)
{
  printf("__GNUC__ = %d\n", __GNUC__);
  return 0;
}

On an Ubuntu 18.04 system, gcc-7.4.0 is the system default. So this program dumps __GNUC__ == 7 when compiled with nvc:

cparrott@dev-sky5 ~ $ ./gnuc 
__GNUC__ = 7
cparrott@dev-sky5 ~ $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04)

However, our build system is CentOS 7, for maximum backward-compatibility. gcc-4.8.5 is the system default GCC on this distribution:

cparrott@hsw7 ~ $ ./gnuc
__GNUC__ = 4
cparrott@hsw7 ~ $ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)

However, even if I bring a newer GCC into the $PATH, nvc still refers to the system default version:

cparrott@hsw7 ~ $ module load gcc/13.1.0
cparrott@hsw7 ~ $ nvc -o gnuc gnuc.c
cparrott@hsw7 ~ $ ./gnuc
__GNUC__ = 4
cparrott@hsw7 ~ $ gcc -v
Reading specs from /home/sw/thirdparty/gcc/gcc-13.1.0/Linux_x86_64/lib/gcc/x86_64-pc-linux-gnu/13.1.0/specs
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/sw/thirdparty/gcc/gcc-13.1.0/Linux_x86_64/libexec/gcc/x86_64-pc-linux-gnu/13.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ../gcc-13.1.0/configure --prefix=/home/sw/thirdparty/gcc/gcc-13.1.0/Linux_x86_64 --enable-languages=c,c++,fortran,lto,objc,obj-c++,d,ada,go,m2 --enable-shared --enable-threads=posix --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --disable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --disable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp --enable-cet=auto --host=x86_64-pc-linux-gnu --target=x86_64-pc-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 13.1.0 (GCC)

This explains why the kernels were not enabled on our build system.

So I'm thinking testing on __GNUC__ for version number may not be particularly useful for nvc here. Thoughts?

cparrott73 · 2023-07-28T20:03:20Z

Sorry for the bad formatting above - can't figure out how to make github make it look right here.

martin-frbg · 2023-07-28T20:14:20Z

You can also use nvc -dM -E with some random C source file like OpenBLAS' ctest.c to see which macros it defines. Indeed checking __GNUC__ only seems to make sense when one can be sure that the compiler actually is gcc, or at least tries to be bug-compatible with the version it masquerades as. The current ifdefs may be a bit too naive in this regard, but they date from the early days of AVX512 support where the important question was if the compiler would cope with these new instructions at all.
(use backticks to achieve somewhat sane formatting, or three backticks in a row for a block of verbatim text)

cparrott73 · 2023-07-28T20:16:28Z

Thanks - edited the above comment to make it more readable.

martin-frbg · 2023-07-28T20:17:36Z

Will have to check if e.g. the Intel compiler defines __GNUC__ as well (wouldn't surprise me now if it did...) and try to come up with a new ifdef that does not take ten lines to cater to every compiler under the sun...

martin-frbg · 2023-07-28T21:10:29Z

so current Intel defines __GNUC__ as well but sets it to 4, meaning they rather than you would need extra help to access the more performant code.... what a mess :(
EDIT: Intel additionally defines __clang__ so is in the clear.

martin-frbg · 2023-08-23T13:55:46Z

Interestingly, a similar miscompilation of the casum/zasum microkernels appears to be present in the current release candidate of LLVM17 as well.

bartoldeman · 2023-11-17T19:11:50Z

It seems NVHPC 23.09 still has this issue, the lapack testcase has a huge number of failures unless I replace 2309 by e.g. 2311 or some other higher number.

martin-frbg · 2023-11-18T10:39:28Z

Hmm that's bad. 23.11 just came out, I'll test ASAP

bartoldeman · 2023-11-18T18:48:37Z

I just tested 23.11 and it still has the exact same issue.

 Test of subprogram number  7         CBLAS_DZASUM   
                                       FAIL

 CASE  N INCX INCY MODE  I                             COMP(I)                             TRUE(I)  DIFFERENCE     SIZE(I)

    7  1    1 9999 9999  1                      0.00000000D+00                      0.70000000D+00 -0.7000D+00  0.7000D+00
    7  2    1 9999 9999  1                      0.00000000D+00                      0.10000000D+01 -0.1000D+01  0.1000D+01
    7  3    1 9999 9999  1                      0.00000000D+00                      0.13000000D+01 -0.1300D+01  0.1300D+01
    7  4    1 9999 9999  1                      0.00000000D+00                      0.17000000D+01 -0.1700D+01  0.1700D+01

martin-frbg · 2023-11-18T21:00:28Z

Me too but dinnertime so not allowed too push a PR right away :)

bartoldeman · 2023-11-19T21:34:58Z

In the end I think the compiler (and icx as well, also LLVM-based) simply optimized out the code since it was dodgy:
#4330

martin-frbg mentioned this issue Jul 26, 2023

Enable use of AVX512 microkernels with NVIDIA HPC from version 22.3 #4164

Merged

martin-frbg closed this as completed in #4164 Jul 29, 2023

This was referenced Aug 24, 2023

[flang][hlfir] No regressions on BLAS/LAPACK llvm/llvm-project#64750

Closed

Work around LLVM17 miscompiling the AVX512 microkernels for CASUM/ZASUM #4204

Merged

martin-frbg mentioned this issue Nov 18, 2023

Temporarily disable the AVX512 CASUM/ZASUM microkernels for any version of NVIDIA HPC #4327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable AVX512 kernels when compiling with nvc (NVIDIA HPC C compiler) #4162

Enable AVX512 kernels when compiling with nvc (NVIDIA HPC C compiler) #4162

cparrott73 commented Jul 25, 2023

martin-frbg commented Jul 25, 2023

martin-frbg commented Jul 26, 2023

cparrott73 commented Jul 26, 2023 •

edited

Loading

martin-frbg commented Jul 26, 2023

cparrott73 commented Jul 26, 2023

cparrott73 commented Jul 26, 2023

cparrott73 commented Jul 26, 2023

martin-frbg commented Jul 27, 2023

cparrott73 commented Jul 28, 2023

martin-frbg commented Jul 28, 2023

martin-frbg commented Jul 28, 2023 •

edited

Loading

cparrott73 commented Jul 28, 2023

cparrott73 commented Jul 28, 2023 •

edited

Loading

cparrott73 commented Jul 28, 2023

martin-frbg commented Jul 28, 2023

cparrott73 commented Jul 28, 2023

martin-frbg commented Jul 28, 2023

martin-frbg commented Jul 28, 2023 •

edited

Loading

martin-frbg commented Aug 23, 2023

bartoldeman commented Nov 17, 2023

martin-frbg commented Nov 18, 2023

bartoldeman commented Nov 18, 2023

martin-frbg commented Nov 18, 2023

bartoldeman commented Nov 19, 2023

Enable AVX512 kernels when compiling with nvc (NVIDIA HPC C compiler) #4162

Enable AVX512 kernels when compiling with nvc (NVIDIA HPC C compiler) #4162

Comments

cparrott73 commented Jul 25, 2023

martin-frbg commented Jul 25, 2023

martin-frbg commented Jul 26, 2023

cparrott73 commented Jul 26, 2023 • edited Loading

martin-frbg commented Jul 26, 2023

cparrott73 commented Jul 26, 2023

cparrott73 commented Jul 26, 2023

cparrott73 commented Jul 26, 2023

martin-frbg commented Jul 27, 2023

cparrott73 commented Jul 28, 2023

martin-frbg commented Jul 28, 2023

martin-frbg commented Jul 28, 2023 • edited Loading

cparrott73 commented Jul 28, 2023

cparrott73 commented Jul 28, 2023 • edited Loading

cparrott73 commented Jul 28, 2023

martin-frbg commented Jul 28, 2023

cparrott73 commented Jul 28, 2023

martin-frbg commented Jul 28, 2023

martin-frbg commented Jul 28, 2023 • edited Loading

martin-frbg commented Aug 23, 2023

bartoldeman commented Nov 17, 2023

martin-frbg commented Nov 18, 2023

bartoldeman commented Nov 18, 2023

martin-frbg commented Nov 18, 2023

bartoldeman commented Nov 19, 2023

cparrott73 commented Jul 26, 2023 •

edited

Loading

martin-frbg commented Jul 28, 2023 •

edited

Loading

cparrott73 commented Jul 28, 2023 •

edited

Loading

martin-frbg commented Jul 28, 2023 •

edited

Loading