Using the new debuginfo=true at JuliaLang/julia@d19eac0#commitcomment-183756517
A claude analysis:
Benchmark Run Characterization — 2026-04-28 (Julia 1.14.0-DEV.2086)
What the debuginfo tells us
The _primary.debuginfo.jsonl logs RSS, GC time, and total allocations at the
start/finish of every benchmark. All 4818 benchmarks completed with status ok.
Why runs take so long
Total measured wall time: ~13,965s (~3.9 hours)
| Group |
N |
Wall time |
GC% |
Notes |
array |
756 |
4,566s |
43% |
Largest group; 11 TB allocated |
scalar |
1363 |
1,698s |
64% |
1363 benchmarks × ~1.2s each |
union |
434 |
1,549s |
70% |
|
sparse |
326 |
1,404s |
55% |
|
inference |
36 |
1,370s |
89% |
36 benchmarks, avg 38s each |
linalg |
187 |
870s |
50% |
|
collection |
345 |
831s |
71% |
|
Two distinct pathologies:
-
Inference benchmarks have extreme inter-sample GC overhead. BenchmarkTools
calls GC.gc() between samples (outside the timed region), but those runs fall
inside the benchmark_start→benchmark_finish window the debuginfo measures.
Each of the 36 inference benchmarks takes 38–48s of wall time, spending 89–99%
of that in inter-sample GC.gc() calls cleaning up 40–60 GB of allocations from
the previous sample. The inference measurements themselves are valid, but the
collection cost dominates the wall clock. Reducing samples/evals for these
benchmarks, or avoiding retaining large intermediate compiler data structures
between samples, would help significantly.
-
array, scalar, union dominate by sheer count. Thousands of parameterized
micro-benchmarks each taking 0.5–5s. The array group alone: 756 benchmarks ×
avg 6s = 4,566s, allocating ~11 TB total across the run. Same inter-sample GC
dynamic applies — 43% of array wall time is GC between samples.
924 benchmarks (19%) spend >80% of their wall time in inter-sample GC. This is
the dominant overhead across the suite.
RSS growth anomalies — several benchmarks permanently grow process RSS by >1 GB:
| Benchmark |
ΔRSS |
Wall |
shootout/pidigits |
+4.4 GB |
9.3s |
misc/23042 ComplexF32 |
+1.8 GB |
1.5s |
random/ranges RangeGenerator BigInt |
+1.6 GB |
5.3s |
union/array broadcast abs Bool |
+1.2 GB |
11.7s |
array/index sumcolon Matrix{Float32} |
+1.1 GB |
7.9s |
These are likely GMP/MPFR arena growth or persistent caches. Since RSS never shrinks
between benchmarks, this inflates GC pressure cumulatively across the full run.
Potential speed improvements
-
Reduce inference benchmark sample counts. The inference benchmarks are slow
not because of the measured inference itself but because cleaning up 40–60 GB of
allocations between each sample costs ~38s. Fewer samples (they are already
high-allocation) would dramatically cut their wall time without affecting result
quality.
-
Reduce array group parameter matrix. 756 benchmarks is the largest absolute
time sink (4,566s). Pruning redundant size/type combinations would have the biggest
wall-clock impact on total CI time.
-
Investigate RSS leaks. shootout/pidigits (+4.4 GB), random/ranges BigInt,
and misc/23042 permanently grow RSS, likely via GMP/MPFR arenas. This inflates
heap size — and therefore GC pause duration — for all benchmarks that run after
them.
-
find group: 92% GC. 63 benchmarks spending 92% of wall time in inter-sample
GC is anomalously high for bit/array search operations and warrants a dedicated
allocation profile to understand why samples produce so much garbage.
-
Precompile BaseBenchmarks. The 2–3s per-group JIT load time visible in
primary.out (for linalg, union, array, etc.) would be eliminated by
precompiling BaseBenchmarks with -g2 -O3 matching the Julia build flags.
Using the new
debuginfo=trueat JuliaLang/julia@d19eac0#commitcomment-183756517A claude analysis:
Benchmark Run Characterization — 2026-04-28 (Julia 1.14.0-DEV.2086)
What the debuginfo tells us
The
_primary.debuginfo.jsonllogs RSS, GC time, and total allocations at thestart/finish of every benchmark. All 4818 benchmarks completed with status
ok.Why runs take so long
Total measured wall time: ~13,965s (~3.9 hours)
arrayscalarunionsparseinferencelinalgcollectionTwo distinct pathologies:
Inference benchmarks have extreme inter-sample GC overhead. BenchmarkTools
calls
GC.gc()between samples (outside the timed region), but those runs fallinside the
benchmark_start→benchmark_finishwindow the debuginfo measures.Each of the 36 inference benchmarks takes 38–48s of wall time, spending 89–99%
of that in inter-sample
GC.gc()calls cleaning up 40–60 GB of allocations fromthe previous sample. The inference measurements themselves are valid, but the
collection cost dominates the wall clock. Reducing samples/evals for these
benchmarks, or avoiding retaining large intermediate compiler data structures
between samples, would help significantly.
array,scalar,uniondominate by sheer count. Thousands of parameterizedmicro-benchmarks each taking 0.5–5s. The
arraygroup alone: 756 benchmarks ×avg 6s = 4,566s, allocating ~11 TB total across the run. Same inter-sample GC
dynamic applies — 43% of
arraywall time is GC between samples.924 benchmarks (19%) spend >80% of their wall time in inter-sample GC. This is
the dominant overhead across the suite.
RSS growth anomalies — several benchmarks permanently grow process RSS by >1 GB:
shootout/pidigitsmisc/23042 ComplexF32random/ranges RangeGenerator BigIntunion/array broadcast abs Boolarray/index sumcolon Matrix{Float32}These are likely GMP/MPFR arena growth or persistent caches. Since RSS never shrinks
between benchmarks, this inflates GC pressure cumulatively across the full run.
Potential speed improvements
Reduce inference benchmark sample counts. The inference benchmarks are slow
not because of the measured inference itself but because cleaning up 40–60 GB of
allocations between each sample costs ~38s. Fewer samples (they are already
high-allocation) would dramatically cut their wall time without affecting result
quality.
Reduce
arraygroup parameter matrix. 756 benchmarks is the largest absolutetime sink (4,566s). Pruning redundant size/type combinations would have the biggest
wall-clock impact on total CI time.
Investigate RSS leaks.
shootout/pidigits(+4.4 GB),random/ranges BigInt,and
misc/23042permanently grow RSS, likely via GMP/MPFR arenas. This inflatesheap size — and therefore GC pause duration — for all benchmarks that run after
them.
findgroup: 92% GC. 63 benchmarks spending 92% of wall time in inter-sampleGC is anomalously high for bit/array search operations and warrants a dedicated
allocation profile to understand why samples produce so much garbage.
Precompile BaseBenchmarks. The 2–3s per-group JIT load time visible in
primary.out(forlinalg,union,array, etc.) would be eliminated byprecompiling BaseBenchmarks with
-g2 -O3matching the Julia build flags.