Skip to content

explicitly SIMD muladd with duals #560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

Conversation

KristofferC
Copy link
Collaborator

Not sure how often this is encountered in the wild.

PR (2 vectorized 1 scalar madds):

julia> code_native(muladd, Tuple{ntuple(_->ForwardDiff.Dual{Nothing, Float64, 4},3)...}, debuginfo=:none)
        .section        __TEXT,__text,regular,pure_instructions
        movq    %rdi, %rax
        vmovupd 8(%rsi), %ymm0
        vbroadcastsd    (%rdx), %ymm1
        vfmadd213pd     8(%rcx), %ymm1, %ymm0   ## ymm0 = (ymm1 * ymm0) + mem
        vbroadcastsd    (%rsi), %ymm2
        vfmadd231pd     8(%rdx), %ymm2, %ymm0   ## ymm0 = (ymm2 * mem) + ymm0
        vfmadd213sd     (%rcx), %xmm1, %xmm2    ## xmm2 = (xmm1 * xmm2) + mem
        vmovsd  %xmm2, (%rdi)
        vmovupd %ymm0, 8(%rdi)
        vzeroupper
        retq

Master: (6 scalar fmadd).

julia> code_native(muladd, Tuple{ntuple(_->ForwardDiff.Dual{Nothing, Float64, 4},3)...}, debuginfo=:none)
        .section        __TEXT,__text,regular,pure_instructions
        movq    %rdi, %rax
        vmovsd  (%rdx), %xmm0                   ## xmm0 = mem[0],zero
        vmovsd  8(%rsi), %xmm1                  ## xmm1 = mem[0],zero
        vmovsd  16(%rsi), %xmm2                 ## xmm2 = mem[0],zero
        vfmadd213sd     8(%rcx), %xmm0, %xmm1   ## xmm1 = (xmm0 * xmm1) + mem
        vfmadd213sd     16(%rcx), %xmm0, %xmm2  ## xmm2 = (xmm0 * xmm2) + mem
        vmovsd  24(%rsi), %xmm3                 ## xmm3 = mem[0],zero
        vfmadd213sd     24(%rcx), %xmm0, %xmm3  ## xmm3 = (xmm0 * xmm3) + mem
        vunpcklpd       %xmm3, %xmm2, %xmm2     ## xmm2 = xmm2[0],xmm3[0]
        vbroadcastsd    (%rsi), %ymm3
        vmovsd  (%rcx), %xmm4                   ## xmm4 = mem[0],zero
        vunpcklpd       %xmm1, %xmm4, %xmm1     ## xmm1 = xmm4[0],xmm1[0]
        vinsertf128     $1, %xmm2, %ymm1, %ymm1
        vfmadd231pd     (%rdx), %ymm3, %ymm1    ## ymm1 = (ymm3 * mem) + ymm1
        vmovsd  32(%rsi), %xmm2                 ## xmm2 = mem[0],zero
        vfmadd213sd     32(%rcx), %xmm0, %xmm2  ## xmm2 = (xmm0 * xmm2) + mem
        vfmadd231sd     32(%rdx), %xmm3, %xmm2  ## xmm2 = (xmm3 * mem) + xmm2
        vmovupd %ymm1, (%rdi)
        vmovsd  %xmm2, 32(%rdi)
        vzeroupper
        retq
        nopw    %cs:(%rax,%rax)

I didn't do the one where one argument is scalar because it seems SLP gets to that.

Copy link
Contributor

@chriselrod chriselrod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth having Partials of a SIMDType wrap a NTuple{N,Core.VecElement}s to begin with, to avoid all the array <-> vector conversions?

This would change the underlying memory layout:

julia> for i  1:12; @show i, sizeof(NTuple{i,Core.VecElement{Float64}}); end
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (1, 8)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (2, 16)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (3, 32)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (4, 32)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (5, 64)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (6, 64)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (7, 64)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (8, 64)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (9, 128)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (10, 128)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (11, 128)
(i, sizeof(NTuple{i, Core.VecElement{Float64}})) = (12, 128)

vs the array case, but this may also lead to more efficient codegen.

I think it's also worth rethinking/experimenting with different default chunk sizes at this point.

@chriselrod
Copy link
Contributor

chriselrod commented Nov 20, 2021

Additionally, I'd consider some @fastmath flags to allow for the elimination of 0.0 * ..., e.g. it may be known by the compiler that xp or yp are all 0.0 (e.g., if they were created due to promoting a Float64 into a Dual).

Overall, I'm a big fan of these changes (thanks @YingboMa ).
Next week, I'll see what sort of impacts this has on Pumas' compile and run times. Depending on the problem, compile times are dominated by inference or ForwardDiff, while heavy use of (nested) ForwardDiff is probably a major contributor to runtime as well.

@KristofferC
Copy link
Collaborator Author

Might be worth having Partials of a SIMDType wrap a NTuple{N,Core.VecElement}s to begin with, to avoid all the array <-> vector conversions?

It's worth trying. My guess is that there is a lot of code relying that has dug into the internals of ForwardDiff and assumes it will be a number though.

@chriselrod
Copy link
Contributor

Might be worth having Partials of a SIMDType wrap a NTuple{N,Core.VecElement}s to begin with, to avoid all the array <-> vector conversions?

It's worth trying. My guess is that there is a lot of code relying that has dug into the internals of ForwardDiff and assumes it will be a number though.

Hopefully most of that code is getindexing the partials, instead of extracting the tuples from the partials first.
But I think I've admittedly written some code like that myself in the past...

Copy link
Member

@YingboMa YingboMa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need a N=0 specialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants