fix detection of AVX512 support in FFTW easyblock (DON'T MERGE!) #1416

bartoldeman · 2018-05-04T15:30:44Z

avx512 support means having at least avx512f and avx512cd in cpuinfo
plus a bunch of others depending on whether it's Knights Landing or
Skylake, but not "avx512" by itself.

avx512 support means having at least avx512f and avx512cd in cpuinfo plus a bunch of others depending on whether it's Knights Landing or Skylake, but not "avx512" by itself.

boegel · 2018-05-06T08:32:57Z

easybuild/easyblocks/f/fftw.py

            if 'avx1.0' in avail_cpu_features:
                avail_cpu_features.append('avx')
+            # avx512 availability is indicated via avx512f and other features starting with avx512
+            if 'avx512f' in avail_cpu_features:


@bartoldeman Maybe this is a better idea?

if any(feat.startswith('avx512') for feat in avail_cpu_features): avail_cpu_features.append('avx512')

No, avx512f is correct. I will clarify with a comment tomorrow.

the reason is here:
https://github.com/FFTW/fftw3/blob/master/simd-support/simd-avx512.h#L47
The avx512 support in FFTW uses the common subset of Skylake X avx512 and Knights Landing avx512, which is avx512f (foundation, lots of instructions) and avx512cd (conflict detection, just three instructions), but of those two only avx512f is used.

@bartoldeman OK, thanks for clarifying. Maybe we should include a comment in the code itself to clarify this?

boegel · 2018-05-07T20:57:43Z

@bartoldeman The proposed fix works as expected, i.e. --enable-avx512 is now indeed added as a configure option too, but a quick single-core FFTW benchmark makes me reluctant to actually merge this:

FFTW installation on Intel Xeon Gold 6140 without using --enable-avx512:

$ perf stat ./simple_example
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example':

      26465.270071      task-clock (msec)         #    0.997 CPUs utilized
               135      context-switches          #    0.005 K/sec
                 0      cpu-migrations            #    0.000 K/sec
            15,181      page-faults               #    0.574 K/sec
    56,280,323,033      cycles                    #    2.127 GHz
    40,212,375,087      instructions              #    0.71  insn per cycle
     1,843,480,268      branches                  #   69.657 M/sec
         3,710,660      branch-misses             #    0.20% of all branches

      26.543430038 seconds time elapsed

vs an FFTW installation with --enable-avx512:

$ perf stat ./simple_example
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example':

      46858.725575      task-clock (msec)         #    0.997 CPUs utilized
                66      context-switches          #    0.001 K/sec
                 2      cpu-migrations            #    0.000 K/sec
            15,959      page-faults               #    0.341 K/sec
   120,945,793,045      cycles                    #    2.581 GHz
    15,847,217,064      instructions              #    0.13  insn per cycle
       786,057,237      branches                  #   16.775 M/sec
           743,027      branch-misses             #    0.09% of all branches

      46.990926684 seconds time elapsed

The perf results seem to confirm that AVX-512 is well used (only 15B instructions vs 40B for the same workload), but 47s vs 26.5s, that's a lot slower... What gives, any idea?

Benchmark used is this:

curl -OL http://micro.stanford.edu/mediawiki/images/a/a9/Simple_example.tar
tar xfv Simple_example.tar
cd simple_example
sed -i'' 's/\(N[01] =\) [0-9]*/\1 16384/g' simple_example.c
gcc -O2 -march=native simple_example.c -lfftw3 -lm -o simple_example

boegel · 2018-05-08T11:14:11Z

So, I spent some time figuring out the instruction mix with the AVX2 build of FFTW compared to the AVX512 build, see attached figure (font is small, but there are a lot of different instructions :)).

In total, there were 40B instructions with AVX2, 15B with AVX-512, but the AVX-512 run is a lot slower...

Only looking at instructions that are occur frequently enough:

ADD: 3.06B or ~8% (AVX2) vs 1.17B or ~8% (AVX-512)
CMP: 1.71B (~4%) vs 667M (~4%)
JNZ: 1.71B (~4%) vs 548M (~3.5%)
KMOVB: ~0 vs 1.22B (~8%)
LEA: 66M (~0%) vs 1.22B (~8%)
MOV: 2.73B (~7%) vs 1.39B (~9%)
VADDPD: 3.52B (~9%) vs 870M (~6%)
VFMADD132PD: 687M (~2%) vs 130M (~1%)
VFMADDSUB231PD: 251M vs (none)
VFMSUBADD132PD: 788M (~2%) vs 369M (~2.5%)
VFMSUBADD231PD: 738M (~2%) vs 266M (~2%)
VMOVAPD: 4.38B (~11%) vs 367M (~2.5%)
VMOVAPS: 3.43B (~8.5%) vs ~0
VMOVDDUP: (none) vs 411M (~3%)
VMOVSD: 4.29B (~11%) vs 1.07B (~7%)
VMOVUPD: (none) vs 545M (~3.5%)
VMULPD: 2.19B (~5.4%) vs 444M (~3%)
VPERMILPD: 3.95B (~10%) vs (none)
VSHUFPD: (none) vs 635M (~4%)
VSUBPD: 3.52B (~9%) vs 870M (~6%)
VUNPCKHPD: (none) vs 411M (~3%)
VXORPD: 738M (~2%) vs (none)

(accounted for: ~94.5% of AVX2 run; ~94% of AVX-512 run)

It seems like data movement is done very different (see LEA, VMOV*)...

bartoldeman · 2018-05-08T12:18:47Z

Interesting... I am running the same test now to compare. Then we should open a ticket here right?
https://github.com/FFTW/fftw3/issues

boegel · 2018-05-08T12:31:25Z

@bartoldeman If you can confirm the same pattern, yes...

bartoldeman · 2018-05-08T15:43:21Z

Same pattern, but will need to look at the clocks:

[oldeman@cdr1001 ~]$ perf stat ./simple_example_avx512
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_avx512':

      77045.742681      task-clock (msec)         #    0.995 CPUs utilized          
               188      context-switches          #    0.002 K/sec                  
                 6      cpu-migrations            #    0.000 K/sec                  
             26944      page-faults               #    0.350 K/sec                  
       76925956326      cycles                    #    0.998 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
       16116749458      instructions              #    0.21  insns per cycle        
         844858149      branches                  #   10.966 M/sec                  
           1697188      branch-misses             #    0.20% of all branches        

      77.422852101 seconds time elapsed

[oldeman@cdr1001 ~]$ perf stat ./simple_example_avx2  
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_avx2':

      43940.120966      task-clock (msec)         #    0.990 CPUs utilized          
               223      context-switches          #    0.005 K/sec                  
                 5      cpu-migrations            #    0.000 K/sec                  
             13877      page-faults               #    0.316 K/sec                  
       43939266396      cycles                    #    1.000 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
       37161600195      instructions              #    0.85  insns per cycle        
        1440420671      branches                  #   32.781 M/sec                  
           4312889      branch-misses             #    0.30% of all branches        

      44.378332238 seconds time elapsed

bartoldeman · 2018-05-08T15:43:59Z

Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz

boegel · 2018-05-08T15:49:28Z

I'm seeing the same problem with both gompi/2018a and intel/2018a.

Since this is a single-core benchmark, it can't be explained by (significantly) lower (turbo) clock speed when running AVX-512 workloads.

bartoldeman · 2018-05-08T16:02:32Z

On Niagara with Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz I get
avx2: 18.6s user time
avx512: 35.1s user time
this is also consistent.

bartoldeman · 2018-05-08T16:06:36Z

going by perf numbers the CPU freq was even higher for avx512 than for avx2 for you!
For me on Cedar it only reports 1.0GHz which is low (I need to ask the admins there if Turbo is disabled on that node though)
Niagara doesn't have perf installed yet (should be there tomorrow), but it's consistent.

bartoldeman · 2018-05-08T16:59:22Z

ok, it seems that that cdr1001 node is a behaving a bit oddly. I just tried on a different Cedar node:

[oldeman@cdr1132 ~]$ perf stat ./simple_example_avx512
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_avx512':

      33778.465236      task-clock (msec)         #    0.985 CPUs utilized          
               193      context-switches          #    0.006 K/sec                  
                 6      cpu-migrations            #    0.000 K/sec                  
            14,712      page-faults               #    0.436 K/sec                  
   120,165,603,855      cycles                    #    3.557 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    15,863,091,988      instructions              #    0.13  insns per cycle        
       800,490,575      branches                  #   23.698 M/sec                  
         1,155,973      branch-misses             #    0.14% of all branches        

      34.294724263 seconds time elapsed

[oldeman@cdr1132 ~]$ perf stat ./simple_example_avx2
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_avx2':

      21705.792261      task-clock (msec)         #    0.984 CPUs utilized          
               156      context-switches          #    0.007 K/sec                  
                 3      cpu-migrations            #    0.000 K/sec                  
            13,265      page-faults               #    0.611 K/sec                  
    79,172,281,876      cycles                    #    3.648 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    37,093,565,121      instructions              #    0.47  insns per cycle        
     1,428,803,107      branches                  #   65.826 M/sec                  
         3,999,329      branch-misses             #    0.28% of all branches        

      22.053044930 seconds time elapsed

this is much more consistent

bartoldeman · 2018-05-08T17:04:25Z

I'll open the FFTW issue. Figure it's dinner time in Belgium.

damianam · 2018-05-09T08:10:59Z

Guys, did you try that on KNL? I am very surprised that those results. Maybe for some reason there are lots of stalls in the pipeline and the reorder buffers can't compensate for them? I don't know, it looks odd.

boegel · 2018-05-09T08:13:10Z

@damianam Based on the feedback we got from the FFTW developers, I'm not too surprised... See FFTW/fftw3#143

damianam · 2018-05-09T08:35:22Z

I am not sure what to get from there. I mean, MIC AKA Xeon Phi 1st Gen AKA KNC, doesn't have AVX512....... Larrabee didn't have AVX512 either....... They had a different instruction set with register that were 512bits wide, but it wasn't AVX512. The first processor to have AVX512 was KNL and then Xeon Skylakes (desktop skylakes don't have AVX512 I think).

boegel · 2018-05-09T08:40:34Z

My main takeback is this statement: We have never tested FFTW on Skylake.

So I'm not too surprised by performance issues they were not aware of yet...

damianam · 2018-05-09T10:21:45Z

My take away is a bit different: We have never tested our AVX512 implementation

boegel · 2018-09-15T09:46:08Z

@bartoldeman Is there any point in keeping this open?

migueldiascosta · 2018-09-18T07:28:39Z

tested on KNL, avx512 is more than 2x slower than avx2:

$ perf stat ./simple_example_gompi_avx2
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_gompi_avx2':

      71518.233388      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              3237      page-faults:u             #    0.045 K/sec                  
      102588721887      cycles:u                  #    1.434 GHz                      (50.00%)
       36636251856      instructions:u            #    0.36  insn per cycle           (75.00%)
        1327766226      branches:u                #   18.565 M/sec                    (75.00%)
          39760490      branch-misses:u           #    2.99% of all branches          (75.00%)

      71.520449978 seconds time elapsed

$ perf stat ./simple_example_gompi_avx512
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_gompi_avx512':

     166277.523211      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              3476      page-faults:u             #    0.021 K/sec                  
      217271335050      cycles:u                  #    1.307 GHz                      (50.00%)
       15635159680      instructions:u            #    0.07  insn per cycle           (75.00%)
         757935276      branches:u                #    4.558 M/sec                    (75.00%)
          12691849      branch-misses:u           #    1.67% of all branches          (75.00%)

     166.312421066 seconds time elapsed

$ perf stat ./simple_example_intel_avx2
power of original data is 24016999034126336.000000
power of transform is 24016999034126336.000000

 Performance counter stats for './simple_example_intel_avx2':

      72619.106133      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              3278      page-faults:u             #    0.045 K/sec                  
      104035356929      cycles:u                  #    1.433 GHz                      (50.00%)
       38496169095      instructions:u            #    0.37  insn per cycle           (75.00%)
         886449175      branches:u                #   12.207 M/sec                    (75.00%)
          39897386      branch-misses:u           #    4.50% of all branches          (75.00%)

      72.657863387 seconds time elapsed

$ perf stat ./simple_example_intel_avx512
power of original data is 24016999034126336.000000
power of transform is 24016999034126336.000000

 Performance counter stats for './simple_example_intel_avx512':

     163560.255948      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              3527      page-faults:u             #    0.022 K/sec                  
      214017920447      cycles:u                  #    1.308 GHz                      (50.00%)
       18342793887      instructions:u            #    0.09  insn per cycle           (75.00%)
         321445116      branches:u                #    1.965 M/sec                    (75.00%)
          12544678      branch-misses:u           #    3.90% of all branches          (75.00%)

     163.597145987 seconds time elapsed

bartoldeman · 2018-09-18T12:32:04Z

we could keep it open until FFTW has better avx512 support, or close it and reopen for tidiness?
Might be worth testing with other block sizes too (figuring out what e.g. Gromacs uses, as for Gromacs FFTW traditionally outperforms Intel FFT), since in the FFTW report Matteo Frigo said a 16kx16k size is probably suboptimal to begin with?

BTW, I did a quick test with the same benchmark (simple_example) using Intel FFT a while ago and it is fast (much faster than FFTW for avx512 for sure)

boegel · 2018-09-18T12:51:35Z

@bartoldeman Let's keep it open under the 3.x milestone then.

migueldiascosta · 2018-09-18T13:18:59Z

@bartoldeman indeed - on the same KNL, using MKL's FFTW wrappers,

$ MKL_ENABLE_INSTRUCTIONS=AVX2 perf stat ./simple_example_mkl 
power of original data is 24016999034126336.000000
power of transform is 24016999034126336.000000

 Performance counter stats for './simple_example_mkl':

      27540.512108      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              3647      page-faults:u             #    0.132 K/sec                  
       35772578670      cycles:u                  #    1.299 GHz                      (41.49%)
       20480451190      instructions:u            #    0.57  insn per cycle           (61.96%)
         474980056      branches:u                #   17.247 M/sec                    (61.98%)
           6685421      branch-misses:u           #    1.41% of all branches          (61.71%)

      27.544029721 seconds time elapsed

$ MKL_ENABLE_INSTRUCTIONS=AVX512 perf stat ./simple_example_mkl 
power of original data is 24016999034126336.000000
power of transform is 24016999034126336.000000

 Performance counter stats for './simple_example_mkl':

      16865.217988      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              2996      page-faults:u             #    0.178 K/sec                  
       22382377544      cycles:u                  #    1.327 GHz                      (50.01%)
       13173523762      instructions:u            #    0.59  insn per cycle           (75.02%)
         382859560      branches:u                #   22.701 M/sec                    (74.99%)
           4089768      branch-misses:u           #    1.07% of all branches          (75.00%)

      16.894555721 seconds time elapse

FFTW: check for avx512f for avx512 support in /proc/cpuinfo flags.

0dfd85b

avx512 support means having at least avx512f and avx512cd in cpuinfo plus a bunch of others depending on whether it's Knights Landing or Skylake, but not "avx512" by itself.

boegel requested changes May 6, 2018

View reviewed changes

boegel added the bug fix label May 6, 2018

boegel added this to the 3.6.1 milestone May 6, 2018

boegel changed the title ~~FFTW: check for avx512f for avx512 support in /proc/cpuinfo flags.~~ fix detection of AVX512 support in FFTW easyblock May 6, 2018

Clarify avx512f check via comment.

d0ae1b6

boegel approved these changes May 7, 2018

View reviewed changes

bartoldeman mentioned this pull request May 8, 2018

Performance issues for avx512 on Skylake X nodes. FFTW/fftw3#143

Open

boegel changed the title ~~fix detection of AVX512 support in FFTW easyblock~~ fix detection of AVX512 support in FFTW easyblock (DON'T MERGE!) May 9, 2018

boegel modified the milestones: 3.6.1, next release May 22, 2018

boegel modified the milestones: 3.6.2, next release Jul 5, 2018

boegel modified the milestones: 3.7.0, 3.x Sep 18, 2018

boegel modified the milestones: 3.x, 4.x Feb 20, 2020

boegel mentioned this pull request Mar 17, 2022

fftw.spec: fix typo & enable sse2/avx/avx2 openhpc/ohpc#1410

Merged

fix detection of AVX512 support in FFTW easyblock (DON'T MERGE!) #1416

Are you sure you want to change the base?

fix detection of AVX512 support in FFTW easyblock (DON'T MERGE!) #1416

Uh oh!

Conversation

bartoldeman commented May 4, 2018

Uh oh!

boegel May 6, 2018

Choose a reason for hiding this comment

Uh oh!

bartoldeman May 7, 2018

Choose a reason for hiding this comment

Uh oh!

bartoldeman May 7, 2018

Choose a reason for hiding this comment

Uh oh!

boegel May 7, 2018

Choose a reason for hiding this comment

Uh oh!

boegel commented May 7, 2018

Uh oh!

boegel commented May 8, 2018

Uh oh!

bartoldeman commented May 8, 2018

Uh oh!

boegel commented May 8, 2018

Uh oh!

bartoldeman commented May 8, 2018

Uh oh!

bartoldeman commented May 8, 2018

Uh oh!

boegel commented May 8, 2018

Uh oh!

bartoldeman commented May 8, 2018

Uh oh!

bartoldeman commented May 8, 2018

Uh oh!

bartoldeman commented May 8, 2018

Uh oh!

bartoldeman commented May 8, 2018

Uh oh!

damianam commented May 9, 2018

Uh oh!

boegel commented May 9, 2018

Uh oh!

damianam commented May 9, 2018

Uh oh!

boegel commented May 9, 2018

Uh oh!

damianam commented May 9, 2018

Uh oh!

boegel commented Sep 15, 2018

Uh oh!

migueldiascosta commented Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bartoldeman commented Sep 18, 2018

Uh oh!

boegel commented Sep 18, 2018

Uh oh!

migueldiascosta commented Sep 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

migueldiascosta commented Sep 18, 2018 •

edited

Loading