-
Notifications
You must be signed in to change notification settings - Fork 308
fix detection of AVX512 support in FFTW easyblock (DON'T MERGE!) #1416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
bartoldeman
commented
May 4, 2018
avx512 support means having at least avx512f and avx512cd in cpuinfo
plus a bunch of others depending on whether it's Knights Landing or
Skylake, but not "avx512" by itself.
| if 'avx1.0' in avail_cpu_features: | ||
| avail_cpu_features.append('avx') | ||
| # avx512 availability is indicated via avx512f and other features starting with avx512 | ||
| if 'avx512f' in avail_cpu_features: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartoldeman Maybe this is a better idea?
if any(feat.startswith('avx512') for feat in avail_cpu_features):
avail_cpu_features.append('avx512')There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, avx512f is correct. I will clarify with a comment tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason is here:
https://github.com/FFTW/fftw3/blob/master/simd-support/simd-avx512.h#L47
The avx512 support in FFTW uses the common subset of Skylake X avx512 and Knights Landing avx512, which is avx512f (foundation, lots of instructions) and avx512cd (conflict detection, just three instructions), but of those two only avx512f is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bartoldeman OK, thanks for clarifying. Maybe we should include a comment in the code itself to clarify this?
|
@bartoldeman The proposed fix works as expected, i.e. FFTW installation on Intel Xeon Gold 6140 without using vs an FFTW installation with The perf results seem to confirm that AVX-512 is well used (only 15B instructions vs 40B for the same workload), but 47s vs 26.5s, that's a lot slower... What gives, any idea? Benchmark used is this: |
|
So, I spent some time figuring out the instruction mix with the AVX2 build of FFTW compared to the AVX512 build, see attached figure (font is small, but there are a lot of different instructions :)). In total, there were 40B instructions with AVX2, 15B with AVX-512, but the AVX-512 run is a lot slower... Only looking at instructions that are occur frequently enough:
(accounted for: ~94.5% of AVX2 run; ~94% of AVX-512 run) It seems like data movement is done very different (see |
|
Interesting... I am running the same test now to compare. Then we should open a ticket here right? |
|
@bartoldeman If you can confirm the same pattern, yes... |
|
Same pattern, but will need to look at the clocks: |
|
Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz |
|
I'm seeing the same problem with both Since this is a single-core benchmark, it can't be explained by (significantly) lower (turbo) clock speed when running AVX-512 workloads. |
|
On Niagara with Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz I get |
|
going by perf numbers the CPU freq was even higher for avx512 than for avx2 for you! |
|
ok, it seems that that cdr1001 node is a behaving a bit oddly. I just tried on a different Cedar node: this is much more consistent |
|
I'll open the FFTW issue. Figure it's dinner time in Belgium. |
|
Guys, did you try that on KNL? I am very surprised that those results. Maybe for some reason there are lots of stalls in the pipeline and the reorder buffers can't compensate for them? I don't know, it looks odd. |
|
@damianam Based on the feedback we got from the FFTW developers, I'm not too surprised... See FFTW/fftw3#143 |
|
I am not sure what to get from there. I mean, MIC AKA Xeon Phi 1st Gen AKA KNC, doesn't have AVX512....... Larrabee didn't have AVX512 either....... They had a different instruction set with register that were 512bits wide, but it wasn't AVX512. The first processor to have AVX512 was KNL and then Xeon Skylakes (desktop skylakes don't have AVX512 I think). |
|
My main takeback is this statement: So I'm not too surprised by performance issues they were not aware of yet... |
|
My take away is a bit different: |
|
@bartoldeman Is there any point in keeping this open? |
|
tested on KNL, avx512 is more than 2x slower than avx2: |
|
we could keep it open until FFTW has better avx512 support, or close it and reopen for tidiness? BTW, I did a quick test with the same benchmark (simple_example) using Intel FFT a while ago and it is fast (much faster than FFTW for avx512 for sure) |
|
@bartoldeman Let's keep it open under the |
|
@bartoldeman indeed - on the same KNL, using MKL's FFTW wrappers, |
