Skip to content

Feature/CPU Detection for Apple M1 #40876

Closed
@chriselrod

Description

@chriselrod

Originally posted here.

The Apple M1 supports ARMv8.4-A, but Julia/LLVM treats it like an A7/Cyclone CPU:

julia> versioninfo()
Julia Version 1.7.0-DEV.1107
Commit 5aca7a37be* (2021-05-15 16:39 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.3.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

Which is ARMv8-a. Although the page on the A14 claims it is ARMv8.5-a. for the firestorm/icestorm cores.

As such, atomics are implemented using a load link/conditional store loop:

julia> a = Threads.Atomic{Int}(1)
Base.Threads.Atomic{Int64}(1)

julia> @code_native Threads.atomic_add!(a, 2)
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:405 within `atomic_add!'
        mov     x8, x0
L4:
        ldaxr   x0, [x8]
        add     x9, x0, x1
        stlxr   w10, x9, [x8]
        cbnz    w10, L4
        ret
; └
julia> @code_native Threads.atomic_cas!(a, 5, 2)
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:373 within `atomic_cas!'
        mov     x8, x0
L4:
        ldaxr   x0, [x8]
        cmp     x0, x1
        b.ne    L28
        stlxr   w9, x2, [x8]
        cbnz    w9, L4
        ret
L28:
        clrex
        ret
; └

However, if I start Julia with -C'armv8.4-a':

julia> a = Threads.Atomic{Int}(1)
Base.Threads.Atomic{Int64}(1)

julia> @code_native Threads.atomic_add!(a, 2)
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:405 within `atomic_add!'
        ldaddal x1, x0, [x0]
        ret
; └
julia> @code_native Threads.atomic_cas!(a, 5, 2)
        .section        __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:373 within `atomic_cas!'
        casal   x1, x2, [x0]
        mov     x0, x1
        ret
; └

Starting Julia without -C flags:

julia> using Octavian

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000087 seconds (2 allocations: 40.578 KiB)

julia> @benchmark matmul!($C0,$A,$B) # threaded matmul uses atomics
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.425 μs (0.00% GC)
  median time:      6.525 μs (0.00% GC)
  mean time:        6.530 μs (0.00% GC)
  maximum time:     14.592 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

With -C'armv8.4-a':

julia> using Octavian

julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
  0.000100 seconds (2 allocations: 40.578 KiB)

julia> @benchmark matmul!($C0,$A,$B) # threaded matmul uses atomics
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.258 μs (0.00% GC)
  median time:      6.525 μs (0.00% GC)
  mean time:        6.532 μs (0.00% GC)
  maximum time:     13.475 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     5

I made non-x86 architectures (including the M1) ramp up thread use more slowly, because earlier performance tests suggested the M1 had higher threading overhead. Maybe that was partly because of atomics, and partly because of the lack of a shared L3 cache, and of course maybe for other reasons I don't know.

There's of course more than just atomics separating armv8.(4/5)-a and armv8.

Metadata

Metadata

Assignees

No one assigned

    Labels

    system:apple siliconAffects Apple Silicon only (Darwin/ARM64) - e.g. M1 and other M-series chips

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions