Skip to content

Conversation

@bartoldeman
Copy link
Contributor

avx512 support means having at least avx512f and avx512cd in cpuinfo
plus a bunch of others depending on whether it's Knights Landing or
Skylake, but not "avx512" by itself.

    avx512 support means having at least avx512f and avx512cd in cpuinfo
    plus a bunch of others depending on whether it's Knights Landing or
    Skylake, but not "avx512" by itself.
if 'avx1.0' in avail_cpu_features:
avail_cpu_features.append('avx')
# avx512 availability is indicated via avx512f and other features starting with avx512
if 'avx512f' in avail_cpu_features:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartoldeman Maybe this is a better idea?

if any(feat.startswith('avx512') for feat in avail_cpu_features):
    avail_cpu_features.append('avx512')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, avx512f is correct. I will clarify with a comment tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason is here:
https://github.com/FFTW/fftw3/blob/master/simd-support/simd-avx512.h#L47
The avx512 support in FFTW uses the common subset of Skylake X avx512 and Knights Landing avx512, which is avx512f (foundation, lots of instructions) and avx512cd (conflict detection, just three instructions), but of those two only avx512f is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartoldeman OK, thanks for clarifying. Maybe we should include a comment in the code itself to clarify this?

@boegel boegel added the bug fix label May 6, 2018
@boegel boegel added this to the 3.6.1 milestone May 6, 2018
@boegel boegel changed the title FFTW: check for avx512f for avx512 support in /proc/cpuinfo flags. fix detection of AVX512 support in FFTW easyblock May 6, 2018
@boegel
Copy link
Member

boegel commented May 7, 2018

@bartoldeman The proposed fix works as expected, i.e. --enable-avx512 is now indeed added as a configure option too, but a quick single-core FFTW benchmark makes me reluctant to actually merge this:

FFTW installation on Intel Xeon Gold 6140 without using --enable-avx512:

$ perf stat ./simple_example
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example':

      26465.270071      task-clock (msec)         #    0.997 CPUs utilized
               135      context-switches          #    0.005 K/sec
                 0      cpu-migrations            #    0.000 K/sec
            15,181      page-faults               #    0.574 K/sec
    56,280,323,033      cycles                    #    2.127 GHz
    40,212,375,087      instructions              #    0.71  insn per cycle
     1,843,480,268      branches                  #   69.657 M/sec
         3,710,660      branch-misses             #    0.20% of all branches

      26.543430038 seconds time elapsed

vs an FFTW installation with --enable-avx512:

$ perf stat ./simple_example
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example':

      46858.725575      task-clock (msec)         #    0.997 CPUs utilized
                66      context-switches          #    0.001 K/sec
                 2      cpu-migrations            #    0.000 K/sec
            15,959      page-faults               #    0.341 K/sec
   120,945,793,045      cycles                    #    2.581 GHz
    15,847,217,064      instructions              #    0.13  insn per cycle
       786,057,237      branches                  #   16.775 M/sec
           743,027      branch-misses             #    0.09% of all branches

      46.990926684 seconds time elapsed

The perf results seem to confirm that AVX-512 is well used (only 15B instructions vs 40B for the same workload), but 47s vs 26.5s, that's a lot slower... What gives, any idea?

Benchmark used is this:

curl -OL http://micro.stanford.edu/mediawiki/images/a/a9/Simple_example.tar
tar xfv Simple_example.tar
cd simple_example
sed -i'' 's/\(N[01] =\) [0-9]*/\1 16384/g' simple_example.c
gcc -O2 -march=native simple_example.c -lfftw3 -lm -o simple_example

@boegel
Copy link
Member

boegel commented May 8, 2018

image

So, I spent some time figuring out the instruction mix with the AVX2 build of FFTW compared to the AVX512 build, see attached figure (font is small, but there are a lot of different instructions :)).

In total, there were 40B instructions with AVX2, 15B with AVX-512, but the AVX-512 run is a lot slower...

Only looking at instructions that are occur frequently enough:

  • ADD: 3.06B or ~8% (AVX2) vs 1.17B or ~8% (AVX-512)
  • CMP: 1.71B (~4%) vs 667M (~4%)
  • JNZ: 1.71B (~4%) vs 548M (~3.5%)
  • KMOVB: ~0 vs 1.22B (~8%)
  • LEA: 66M (~0%) vs 1.22B (~8%)
  • MOV: 2.73B (~7%) vs 1.39B (~9%)
  • VADDPD: 3.52B (~9%) vs 870M (~6%)
  • VFMADD132PD: 687M (~2%) vs 130M (~1%)
  • VFMADDSUB231PD: 251M vs (none)
  • VFMSUBADD132PD: 788M (~2%) vs 369M (~2.5%)
  • VFMSUBADD231PD: 738M (~2%) vs 266M (~2%)
  • VMOVAPD: 4.38B (~11%) vs 367M (~2.5%)
  • VMOVAPS: 3.43B (~8.5%) vs ~0
  • VMOVDDUP: (none) vs 411M (~3%)
  • VMOVSD: 4.29B (~11%) vs 1.07B (~7%)
  • VMOVUPD: (none) vs 545M (~3.5%)
  • VMULPD: 2.19B (~5.4%) vs 444M (~3%)
  • VPERMILPD: 3.95B (~10%) vs (none)
  • VSHUFPD: (none) vs 635M (~4%)
  • VSUBPD: 3.52B (~9%) vs 870M (~6%)
  • VUNPCKHPD: (none) vs 411M (~3%)
  • VXORPD: 738M (~2%) vs (none)

(accounted for: ~94.5% of AVX2 run; ~94% of AVX-512 run)

It seems like data movement is done very different (see LEA, VMOV*)...

@bartoldeman
Copy link
Contributor Author

Interesting... I am running the same test now to compare. Then we should open a ticket here right?
https://github.com/FFTW/fftw3/issues

@boegel
Copy link
Member

boegel commented May 8, 2018

@bartoldeman If you can confirm the same pattern, yes...

@bartoldeman
Copy link
Contributor Author

Same pattern, but will need to look at the clocks:

[oldeman@cdr1001 ~]$ perf stat ./simple_example_avx512
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_avx512':

      77045.742681      task-clock (msec)         #    0.995 CPUs utilized          
               188      context-switches          #    0.002 K/sec                  
                 6      cpu-migrations            #    0.000 K/sec                  
             26944      page-faults               #    0.350 K/sec                  
       76925956326      cycles                    #    0.998 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
       16116749458      instructions              #    0.21  insns per cycle        
         844858149      branches                  #   10.966 M/sec                  
           1697188      branch-misses             #    0.20% of all branches        

      77.422852101 seconds time elapsed

[oldeman@cdr1001 ~]$ perf stat ./simple_example_avx2  
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_avx2':

      43940.120966      task-clock (msec)         #    0.990 CPUs utilized          
               223      context-switches          #    0.005 K/sec                  
                 5      cpu-migrations            #    0.000 K/sec                  
             13877      page-faults               #    0.316 K/sec                  
       43939266396      cycles                    #    1.000 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
       37161600195      instructions              #    0.85  insns per cycle        
        1440420671      branches                  #   32.781 M/sec                  
           4312889      branch-misses             #    0.30% of all branches        

      44.378332238 seconds time elapsed

@bartoldeman
Copy link
Contributor Author

Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz

@boegel
Copy link
Member

boegel commented May 8, 2018

I'm seeing the same problem with both gompi/2018a and intel/2018a.

Since this is a single-core benchmark, it can't be explained by (significantly) lower (turbo) clock speed when running AVX-512 workloads.

@bartoldeman
Copy link
Contributor Author

On Niagara with Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz I get
avx2: 18.6s user time
avx512: 35.1s user time
this is also consistent.

@bartoldeman
Copy link
Contributor Author

going by perf numbers the CPU freq was even higher for avx512 than for avx2 for you!
For me on Cedar it only reports 1.0GHz which is low (I need to ask the admins there if Turbo is disabled on that node though)
Niagara doesn't have perf installed yet (should be there tomorrow), but it's consistent.

@bartoldeman
Copy link
Contributor Author

ok, it seems that that cdr1001 node is a behaving a bit oddly. I just tried on a different Cedar node:

[oldeman@cdr1132 ~]$ perf stat ./simple_example_avx512
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_avx512':

      33778.465236      task-clock (msec)         #    0.985 CPUs utilized          
               193      context-switches          #    0.006 K/sec                  
                 6      cpu-migrations            #    0.000 K/sec                  
            14,712      page-faults               #    0.436 K/sec                  
   120,165,603,855      cycles                    #    3.557 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    15,863,091,988      instructions              #    0.13  insns per cycle        
       800,490,575      branches                  #   23.698 M/sec                  
         1,155,973      branch-misses             #    0.14% of all branches        

      34.294724263 seconds time elapsed

[oldeman@cdr1132 ~]$ perf stat ./simple_example_avx2
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_avx2':

      21705.792261      task-clock (msec)         #    0.984 CPUs utilized          
               156      context-switches          #    0.007 K/sec                  
                 3      cpu-migrations            #    0.000 K/sec                  
            13,265      page-faults               #    0.611 K/sec                  
    79,172,281,876      cycles                    #    3.648 GHz                    
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    37,093,565,121      instructions              #    0.47  insns per cycle        
     1,428,803,107      branches                  #   65.826 M/sec                  
         3,999,329      branch-misses             #    0.28% of all branches        

      22.053044930 seconds time elapsed

this is much more consistent

@bartoldeman
Copy link
Contributor Author

I'll open the FFTW issue. Figure it's dinner time in Belgium.

@damianam
Copy link
Member

damianam commented May 9, 2018

Guys, did you try that on KNL? I am very surprised that those results. Maybe for some reason there are lots of stalls in the pipeline and the reorder buffers can't compensate for them? I don't know, it looks odd.

@boegel
Copy link
Member

boegel commented May 9, 2018

@damianam Based on the feedback we got from the FFTW developers, I'm not too surprised... See FFTW/fftw3#143

@damianam
Copy link
Member

damianam commented May 9, 2018

I am not sure what to get from there. I mean, MIC AKA Xeon Phi 1st Gen AKA KNC, doesn't have AVX512....... Larrabee didn't have AVX512 either....... They had a different instruction set with register that were 512bits wide, but it wasn't AVX512. The first processor to have AVX512 was KNL and then Xeon Skylakes (desktop skylakes don't have AVX512 I think).

@boegel
Copy link
Member

boegel commented May 9, 2018

My main takeback is this statement: We have never tested FFTW on Skylake.

So I'm not too surprised by performance issues they were not aware of yet...

@damianam
Copy link
Member

damianam commented May 9, 2018

My take away is a bit different: We have never tested our AVX512 implementation

@boegel boegel changed the title fix detection of AVX512 support in FFTW easyblock fix detection of AVX512 support in FFTW easyblock (DON'T MERGE!) May 9, 2018
@boegel boegel modified the milestones: 3.6.1, next release May 22, 2018
@boegel boegel modified the milestones: 3.6.2, next release Jul 5, 2018
@boegel
Copy link
Member

boegel commented Sep 15, 2018

@bartoldeman Is there any point in keeping this open?

@migueldiascosta
Copy link
Member

migueldiascosta commented Sep 18, 2018

tested on KNL, avx512 is more than 2x slower than avx2:

$ perf stat ./simple_example_gompi_avx2
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_gompi_avx2':

      71518.233388      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              3237      page-faults:u             #    0.045 K/sec                  
      102588721887      cycles:u                  #    1.434 GHz                      (50.00%)
       36636251856      instructions:u            #    0.36  insn per cycle           (75.00%)
        1327766226      branches:u                #   18.565 M/sec                    (75.00%)
          39760490      branch-misses:u           #    2.99% of all branches          (75.00%)

      71.520449978 seconds time elapsed

$ perf stat ./simple_example_gompi_avx512
power of original data is 24016998996695588.000000
power of transform is 24016998996695588.000000

 Performance counter stats for './simple_example_gompi_avx512':

     166277.523211      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              3476      page-faults:u             #    0.021 K/sec                  
      217271335050      cycles:u                  #    1.307 GHz                      (50.00%)
       15635159680      instructions:u            #    0.07  insn per cycle           (75.00%)
         757935276      branches:u                #    4.558 M/sec                    (75.00%)
          12691849      branch-misses:u           #    1.67% of all branches          (75.00%)

     166.312421066 seconds time elapsed

$ perf stat ./simple_example_intel_avx2
power of original data is 24016999034126336.000000
power of transform is 24016999034126336.000000

 Performance counter stats for './simple_example_intel_avx2':

      72619.106133      task-clock:u (msec)       #    0.999 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              3278      page-faults:u             #    0.045 K/sec                  
      104035356929      cycles:u                  #    1.433 GHz                      (50.00%)
       38496169095      instructions:u            #    0.37  insn per cycle           (75.00%)
         886449175      branches:u                #   12.207 M/sec                    (75.00%)
          39897386      branch-misses:u           #    4.50% of all branches          (75.00%)

      72.657863387 seconds time elapsed

$ perf stat ./simple_example_intel_avx512
power of original data is 24016999034126336.000000
power of transform is 24016999034126336.000000

 Performance counter stats for './simple_example_intel_avx512':

     163560.255948      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              3527      page-faults:u             #    0.022 K/sec                  
      214017920447      cycles:u                  #    1.308 GHz                      (50.00%)
       18342793887      instructions:u            #    0.09  insn per cycle           (75.00%)
         321445116      branches:u                #    1.965 M/sec                    (75.00%)
          12544678      branch-misses:u           #    3.90% of all branches          (75.00%)

     163.597145987 seconds time elapsed

@bartoldeman
Copy link
Contributor Author

we could keep it open until FFTW has better avx512 support, or close it and reopen for tidiness?
Might be worth testing with other block sizes too (figuring out what e.g. Gromacs uses, as for Gromacs FFTW traditionally outperforms Intel FFT), since in the FFTW report Matteo Frigo said a 16kx16k size is probably suboptimal to begin with?

BTW, I did a quick test with the same benchmark (simple_example) using Intel FFT a while ago and it is fast (much faster than FFTW for avx512 for sure)

@boegel boegel modified the milestones: 3.7.0, 3.x Sep 18, 2018
@boegel
Copy link
Member

boegel commented Sep 18, 2018

@bartoldeman Let's keep it open under the 3.x milestone then.

@migueldiascosta
Copy link
Member

@bartoldeman indeed - on the same KNL, using MKL's FFTW wrappers,

$ MKL_ENABLE_INSTRUCTIONS=AVX2 perf stat ./simple_example_mkl 
power of original data is 24016999034126336.000000
power of transform is 24016999034126336.000000

 Performance counter stats for './simple_example_mkl':

      27540.512108      task-clock:u (msec)       #    1.000 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              3647      page-faults:u             #    0.132 K/sec                  
       35772578670      cycles:u                  #    1.299 GHz                      (41.49%)
       20480451190      instructions:u            #    0.57  insn per cycle           (61.96%)
         474980056      branches:u                #   17.247 M/sec                    (61.98%)
           6685421      branch-misses:u           #    1.41% of all branches          (61.71%)

      27.544029721 seconds time elapsed

$ MKL_ENABLE_INSTRUCTIONS=AVX512 perf stat ./simple_example_mkl 
power of original data is 24016999034126336.000000
power of transform is 24016999034126336.000000

 Performance counter stats for './simple_example_mkl':

      16865.217988      task-clock:u (msec)       #    0.998 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
              2996      page-faults:u             #    0.178 K/sec                  
       22382377544      cycles:u                  #    1.327 GHz                      (50.01%)
       13173523762      instructions:u            #    0.59  insn per cycle           (75.02%)
         382859560      branches:u                #   22.701 M/sec                    (74.99%)
           4089768      branch-misses:u           #    1.07% of all branches          (75.00%)

      16.894555721 seconds time elapse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants