explicitly SIMD muladd with duals #560

KristofferC · 2021-11-20T11:36:14Z

Not sure how often this is encountered in the wild.

PR (2 vectorized 1 scalar madds):

julia> code_native(muladd, Tuple{ntuple(_->ForwardDiff.Dual{Nothing, Float64, 4},3)...}, debuginfo=:none)
        .section        __TEXT,__text,regular,pure_instructions
        movq    %rdi, %rax
        vmovupd 8(%rsi), %ymm0
        vbroadcastsd    (%rdx), %ymm1
        vfmadd213pd     8(%rcx), %ymm1, %ymm0   ## ymm0 = (ymm1 * ymm0) + mem
        vbroadcastsd    (%rsi), %ymm2
        vfmadd231pd     8(%rdx), %ymm2, %ymm0   ## ymm0 = (ymm2 * mem) + ymm0
        vfmadd213sd     (%rcx), %xmm1, %xmm2    ## xmm2 = (xmm1 * xmm2) + mem
        vmovsd  %xmm2, (%rdi)
        vmovupd %ymm0, 8(%rdi)
        vzeroupper
        retq

Master: (6 scalar fmadd).

julia> code_native(muladd, Tuple{ntuple(_->ForwardDiff.Dual{Nothing, Float64, 4},3)...}, debuginfo=:none)
        .section        __TEXT,__text,regular,pure_instructions
        movq    %rdi, %rax
        vmovsd  (%rdx), %xmm0                   ## xmm0 = mem[0],zero
        vmovsd  8(%rsi), %xmm1                  ## xmm1 = mem[0],zero
        vmovsd  16(%rsi), %xmm2                 ## xmm2 = mem[0],zero
        vfmadd213sd     8(%rcx), %xmm0, %xmm1   ## xmm1 = (xmm0 * xmm1) + mem
        vfmadd213sd     16(%rcx), %xmm0, %xmm2  ## xmm2 = (xmm0 * xmm2) + mem
        vmovsd  24(%rsi), %xmm3                 ## xmm3 = mem[0],zero
        vfmadd213sd     24(%rcx), %xmm0, %xmm3  ## xmm3 = (xmm0 * xmm3) + mem
        vunpcklpd       %xmm3, %xmm2, %xmm2     ## xmm2 = xmm2[0],xmm3[0]
        vbroadcastsd    (%rsi), %ymm3
        vmovsd  (%rcx), %xmm4                   ## xmm4 = mem[0],zero
        vunpcklpd       %xmm1, %xmm4, %xmm1     ## xmm1 = xmm4[0],xmm1[0]
        vinsertf128     $1, %xmm2, %ymm1, %ymm1
        vfmadd231pd     (%rdx), %ymm3, %ymm1    ## ymm1 = (ymm3 * mem) + ymm1
        vmovsd  32(%rsi), %xmm2                 ## xmm2 = mem[0],zero
        vfmadd213sd     32(%rcx), %xmm0, %xmm2  ## xmm2 = (xmm0 * xmm2) + mem
        vfmadd231sd     32(%rdx), %xmm3, %xmm2  ## xmm2 = (xmm3 * mem) + xmm2
        vmovupd %ymm1, (%rdi)
        vmovsd  %xmm2, 32(%rdi)
        vzeroupper
        retq
        nopw    %cs:(%rax,%rax)

I didn't do the one where one argument is scalar because it seems SLP gets to that.

chriselrod

Might be worth having Partials of a SIMDType wrap a NTuple{N,Core.VecElement}s to begin with, to avoid all the array <-> vector conversions?

This would change the underlying memory layout:

julia> for i ∈ 1:12; @show i, sizeof(NTuple{i,Core.VecElement{Float64}}); end
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (1, 8)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (2, 16)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (3, 32)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (4, 32)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (5, 64)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (6, 64)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (7, 64)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (8, 64)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (9, 128)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (10, 128)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (11, 128)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (12, 128)

vs the array case, but this may also lead to more efficient codegen.

I think it's also worth rethinking/experimenting with different default chunk sizes at this point.

chriselrod · 2021-11-20T19:50:28Z

Additionally, I'd consider some @fastmath flags to allow for the elimination of 0.0 * ..., e.g. it may be known by the compiler that xp or yp are all 0.0 (e.g., if they were created due to promoting a Float64 into a Dual).

Overall, I'm a big fan of these changes (thanks @YingboMa ).
Next week, I'll see what sort of impacts this has on Pumas' compile and run times. Depending on the problem, compile times are dominated by inference or ForwardDiff, while heavy use of (nested) ForwardDiff is probably a major contributor to runtime as well.

KristofferC · 2021-11-20T20:53:49Z

Might be worth having Partials of a SIMDType wrap a NTuple{N,Core.VecElement}s to begin with, to avoid all the array <-> vector conversions?

It's worth trying. My guess is that there is a lot of code relying that has dug into the internals of ForwardDiff and assumes it will be a number though.

chriselrod · 2021-11-20T21:11:26Z

Might be worth having Partials of a SIMDType wrap a NTuple{N,Core.VecElement}s to begin with, to avoid all the array <-> vector conversions?

It's worth trying. My guess is that there is a lot of code relying that has dug into the internals of ForwardDiff and assumes it will be a number though.

Hopefully most of that code is getindexing the partials, instead of extracting the tuples from the partials first.
But I think I've admittedly written some code like that myself in the past...

YingboMa

You might need a N=0 specialization.

chriselrod approved these changes Nov 20, 2021

View reviewed changes

YingboMa reviewed Nov 21, 2021

View reviewed changes

KristofferC force-pushed the kc/simd_muladd branch from 977bbcd to d053bd7 Compare November 22, 2021 09:01

KristofferC closed this Nov 22, 2021

KristofferC force-pushed the kc/simd_muladd branch from d053bd7 to 5bb4546 Compare November 22, 2021 09:10

KristofferC mentioned this pull request Nov 22, 2021

explicitly SIMD muladd with duals #562

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

explicitly SIMD muladd with duals #560

explicitly SIMD muladd with duals #560

Uh oh!

KristofferC commented Nov 20, 2021

Uh oh!

chriselrod left a comment

Uh oh!

chriselrod commented Nov 20, 2021 •

edited

Loading

Uh oh!

KristofferC commented Nov 20, 2021

Uh oh!

chriselrod commented Nov 20, 2021

Uh oh!

YingboMa left a comment

Uh oh!

Uh oh!

explicitly SIMD muladd with duals #560

explicitly SIMD muladd with duals #560

Uh oh!

Conversation

KristofferC commented Nov 20, 2021

Uh oh!

chriselrod left a comment

Choose a reason for hiding this comment

Uh oh!

chriselrod commented Nov 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KristofferC commented Nov 20, 2021

Uh oh!

chriselrod commented Nov 20, 2021

Uh oh!

YingboMa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chriselrod commented Nov 20, 2021 •

edited

Loading