Description
Originally posted here.
The Apple M1 supports ARMv8.4-A, but Julia/LLVM treats it like an A7/Cyclone CPU:
julia> versioninfo()
Julia Version 1.7.0-DEV.1107
Commit 5aca7a37be* (2021-05-15 16:39 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin20.3.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, cyclone)
Environment:
JULIA_NUM_THREADS = 4
Which is ARMv8-a. Although the page on the A14 claims it is ARMv8.5-a. for the firestorm/icestorm cores.
As such, atomics are implemented using a load link/conditional store loop:
julia> a = Threads.Atomic{Int}(1)
Base.Threads.Atomic{Int64}(1)
julia> @code_native Threads.atomic_add!(a, 2)
.section __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:405 within `atomic_add!'
mov x8, x0
L4:
ldaxr x0, [x8]
add x9, x0, x1
stlxr w10, x9, [x8]
cbnz w10, L4
ret
; └
julia> @code_native Threads.atomic_cas!(a, 5, 2)
.section __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:373 within `atomic_cas!'
mov x8, x0
L4:
ldaxr x0, [x8]
cmp x0, x1
b.ne L28
stlxr w9, x2, [x8]
cbnz w9, L4
ret
L28:
clrex
ret
; └
However, if I start Julia with -C'armv8.4-a'
:
julia> a = Threads.Atomic{Int}(1)
Base.Threads.Atomic{Int64}(1)
julia> @code_native Threads.atomic_add!(a, 2)
.section __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:405 within `atomic_add!'
ldaddal x1, x0, [x0]
ret
; └
julia> @code_native Threads.atomic_cas!(a, 5, 2)
.section __TEXT,__text,regular,pure_instructions
; ┌ @ atomics.jl:373 within `atomic_cas!'
casal x1, x2, [x0]
mov x0, x1
ret
; └
Starting Julia without -C
flags:
julia> using Octavian
julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.000087 seconds (2 allocations: 40.578 KiB)
julia> @benchmark matmul!($C0,$A,$B) # threaded matmul uses atomics
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.425 μs (0.00% GC)
median time: 6.525 μs (0.00% GC)
mean time: 6.530 μs (0.00% GC)
maximum time: 14.592 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 5
With -C'armv8.4-a'
:
julia> using Octavian
julia> M = K = N = 72; A = rand(M,K); B = rand(K,N); C1 = @time(A * B); C0 = similar(C1);
0.000100 seconds (2 allocations: 40.578 KiB)
julia> @benchmark matmul!($C0,$A,$B) # threaded matmul uses atomics
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.258 μs (0.00% GC)
median time: 6.525 μs (0.00% GC)
mean time: 6.532 μs (0.00% GC)
maximum time: 13.475 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 5
I made non-x86 architectures (including the M1) ramp up thread use more slowly, because earlier performance tests suggested the M1 had higher threading overhead. Maybe that was partly because of atomics, and partly because of the lack of a shared L3 cache, and of course maybe for other reasons I don't know.
There's of course more than just atomics separating armv8.(4/5)-a and armv8.