Make benchmark run faster

Using the new `debuginfo=true` at https://github.com/JuliaLang/julia/commit/d19eac062f4083e8cb4777323904cb87fab35200#commitcomment-183756517 

A claude analysis:

---

## Benchmark Run Characterization — 2026-04-28 (Julia 1.14.0-DEV.2086)

### What the debuginfo tells us

The `_primary.debuginfo.jsonl` logs RSS, GC time, and total allocations at the
start/finish of every benchmark. All 4818 benchmarks completed with status `ok`.

### Why runs take so long

**Total measured wall time: ~13,965s (~3.9 hours)**

| Group       |    N | Wall time | GC%  | Notes                                     |
|-------------|------|-----------|------|-------------------------------------------|
| `array`     |  756 | **4,566s**| 43%  | Largest group; 11 TB allocated            |
| `scalar`    | 1363 | 1,698s    | 64%  | 1363 benchmarks × ~1.2s each             |
| `union`     |  434 | 1,549s    | 70%  |                                           |
| `sparse`    |  326 | 1,404s    | 55%  |                                           |
| `inference` |   36 | **1,370s**| **89%** | 36 benchmarks, avg **38s each**        |
| `linalg`    |  187 |   870s    | 50%  |                                           |
| `collection`|  345 |   831s    | 71%  |                                           |

**Two distinct pathologies:**

1. **Inference benchmarks have extreme inter-sample GC overhead.** BenchmarkTools
   calls `GC.gc()` between samples (outside the timed region), but those runs fall
   inside the `benchmark_start`→`benchmark_finish` window the debuginfo measures.
   Each of the 36 inference benchmarks takes 38–48s of wall time, spending 89–99%
   of that in inter-sample `GC.gc()` calls cleaning up 40–60 GB of allocations from
   the previous sample. The inference *measurements themselves are valid*, but the
   collection cost dominates the wall clock. Reducing samples/evals for these
   benchmarks, or avoiding retaining large intermediate compiler data structures
   between samples, would help significantly.

2. **`array`, `scalar`, `union` dominate by sheer count.** Thousands of parameterized
   micro-benchmarks each taking 0.5–5s. The `array` group alone: 756 benchmarks ×
   avg 6s = 4,566s, allocating ~11 TB total across the run. Same inter-sample GC
   dynamic applies — 43% of `array` wall time is GC between samples.

**924 benchmarks (19%) spend >80% of their wall time in inter-sample GC.** This is
the dominant overhead across the suite.

**RSS growth anomalies** — several benchmarks permanently grow process RSS by >1 GB:

| Benchmark                              | ΔRSS    | Wall  |
|----------------------------------------|---------|-------|
| `shootout/pidigits`                    | +4.4 GB | 9.3s  |
| `misc/23042 ComplexF32`                | +1.8 GB | 1.5s  |
| `random/ranges RangeGenerator BigInt`  | +1.6 GB | 5.3s  |
| `union/array broadcast abs Bool`       | +1.2 GB | 11.7s |
| `array/index sumcolon Matrix{Float32}` | +1.1 GB | 7.9s  |

These are likely GMP/MPFR arena growth or persistent caches. Since RSS never shrinks
between benchmarks, this inflates GC pressure cumulatively across the full run.

### Potential speed improvements

1. **Reduce inference benchmark sample counts.** The inference benchmarks are slow
   not because of the measured inference itself but because cleaning up 40–60 GB of
   allocations between each sample costs ~38s. Fewer samples (they are already
   high-allocation) would dramatically cut their wall time without affecting result
   quality.

2. **Reduce `array` group parameter matrix.** 756 benchmarks is the largest absolute
   time sink (4,566s). Pruning redundant size/type combinations would have the biggest
   wall-clock impact on total CI time.

3. **Investigate RSS leaks.** `shootout/pidigits` (+4.4 GB), `random/ranges BigInt`,
   and `misc/23042` permanently grow RSS, likely via GMP/MPFR arenas. This inflates
   heap size — and therefore GC pause duration — for all benchmarks that run after
   them.

4. **`find` group: 92% GC.** 63 benchmarks spending 92% of wall time in inter-sample
   GC is anomalously high for bit/array search operations and warrants a dedicated
   allocation profile to understand why samples produce so much garbage.

5. **Precompile BaseBenchmarks.** The 2–3s per-group JIT load time visible in
   `primary.out` (for `linalg`, `union`, `array`, etc.) would be eliminated by
   precompiling BaseBenchmarks with `-g2 -O3` matching the Julia build flags.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make benchmark run faster #232

Benchmark Run Characterization — 2026-04-28 (Julia 1.14.0-DEV.2086)

What the debuginfo tells us

Why runs take so long

Potential speed improvements

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Group	N	Wall time	GC%	Notes
`array`	756	4,566s	43%	Largest group; 11 TB allocated
`scalar`	1363	1,698s	64%	1363 benchmarks × ~1.2s each
`union`	434	1,549s	70%
`sparse`	326	1,404s	55%
`inference`	36	1,370s	89%	36 benchmarks, avg 38s each
`linalg`	187	870s	50%
`collection`	345	831s	71%

Benchmark	ΔRSS	Wall
`shootout/pidigits`	+4.4 GB	9.3s
`misc/23042 ComplexF32`	+1.8 GB	1.5s
`random/ranges RangeGenerator BigInt`	+1.6 GB	5.3s
`union/array broadcast abs Bool`	+1.2 GB	11.7s
`array/index sumcolon Matrix{Float32}`	+1.1 GB	7.9s

Make benchmark run faster #232

Description

Benchmark Run Characterization — 2026-04-28 (Julia 1.14.0-DEV.2086)

What the debuginfo tells us

Why runs take so long

Potential speed improvements

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions