Skip to content

Make benchmark run faster #232

@IanButterworth

Description

@IanButterworth

Using the new debuginfo=true at JuliaLang/julia@d19eac0#commitcomment-183756517

A claude analysis:


Benchmark Run Characterization — 2026-04-28 (Julia 1.14.0-DEV.2086)

What the debuginfo tells us

The _primary.debuginfo.jsonl logs RSS, GC time, and total allocations at the
start/finish of every benchmark. All 4818 benchmarks completed with status ok.

Why runs take so long

Total measured wall time: ~13,965s (~3.9 hours)

Group N Wall time GC% Notes
array 756 4,566s 43% Largest group; 11 TB allocated
scalar 1363 1,698s 64% 1363 benchmarks × ~1.2s each
union 434 1,549s 70%
sparse 326 1,404s 55%
inference 36 1,370s 89% 36 benchmarks, avg 38s each
linalg 187 870s 50%
collection 345 831s 71%

Two distinct pathologies:

  1. Inference benchmarks have extreme inter-sample GC overhead. BenchmarkTools
    calls GC.gc() between samples (outside the timed region), but those runs fall
    inside the benchmark_startbenchmark_finish window the debuginfo measures.
    Each of the 36 inference benchmarks takes 38–48s of wall time, spending 89–99%
    of that in inter-sample GC.gc() calls cleaning up 40–60 GB of allocations from
    the previous sample. The inference measurements themselves are valid, but the
    collection cost dominates the wall clock. Reducing samples/evals for these
    benchmarks, or avoiding retaining large intermediate compiler data structures
    between samples, would help significantly.

  2. array, scalar, union dominate by sheer count. Thousands of parameterized
    micro-benchmarks each taking 0.5–5s. The array group alone: 756 benchmarks ×
    avg 6s = 4,566s, allocating ~11 TB total across the run. Same inter-sample GC
    dynamic applies — 43% of array wall time is GC between samples.

924 benchmarks (19%) spend >80% of their wall time in inter-sample GC. This is
the dominant overhead across the suite.

RSS growth anomalies — several benchmarks permanently grow process RSS by >1 GB:

Benchmark ΔRSS Wall
shootout/pidigits +4.4 GB 9.3s
misc/23042 ComplexF32 +1.8 GB 1.5s
random/ranges RangeGenerator BigInt +1.6 GB 5.3s
union/array broadcast abs Bool +1.2 GB 11.7s
array/index sumcolon Matrix{Float32} +1.1 GB 7.9s

These are likely GMP/MPFR arena growth or persistent caches. Since RSS never shrinks
between benchmarks, this inflates GC pressure cumulatively across the full run.

Potential speed improvements

  1. Reduce inference benchmark sample counts. The inference benchmarks are slow
    not because of the measured inference itself but because cleaning up 40–60 GB of
    allocations between each sample costs ~38s. Fewer samples (they are already
    high-allocation) would dramatically cut their wall time without affecting result
    quality.

  2. Reduce array group parameter matrix. 756 benchmarks is the largest absolute
    time sink (4,566s). Pruning redundant size/type combinations would have the biggest
    wall-clock impact on total CI time.

  3. Investigate RSS leaks. shootout/pidigits (+4.4 GB), random/ranges BigInt,
    and misc/23042 permanently grow RSS, likely via GMP/MPFR arenas. This inflates
    heap size — and therefore GC pause duration — for all benchmarks that run after
    them.

  4. find group: 92% GC. 63 benchmarks spending 92% of wall time in inter-sample
    GC is anomalously high for bit/array search operations and warrants a dedicated
    allocation profile to understand why samples produce so much garbage.

  5. Precompile BaseBenchmarks. The 2–3s per-group JIT load time visible in
    primary.out (for linalg, union, array, etc.) would be eliminated by
    precompiling BaseBenchmarks with -g2 -O3 matching the Julia build flags.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions