Summary
After landing wish pg-test-perf, PG Tests (serial, 208 files, 3626 tests) consistently shows 1-3 flaky failures per run, with the failing test different every time. Across 7 consecutive CI runs on PR #1317:
| Run |
Wall-clock |
Flakes |
Which tests |
| 1 |
145s (pre-timeout-bump) |
3 |
events-stream drain, qa-runner × 2 |
| 2 |
64.9s |
2 |
Team Manager createTeam, turn-close happy path |
| 3 |
67.5s |
1 |
(unnamed) beforeEach timeout |
| 4 |
87.9s |
2 |
serve lifecycle, events-stream cursor |
| 5 |
64.9s |
1 |
Group 1 observability migrations |
| 6 |
76.1s |
1 |
pg > register and get |
| 7 |
68.7s |
1 + 2 errors |
pg > syncWishes upserts |
Never the same test twice. That's the classic signature of environmental noise — Blacksmith runner jitter, pgserve cold-start variance, NOTIFY/LISTEN timing, fixture-order dependence. Not a deterministic code bug.
Impact
- Merge velocity: every PR needs a
gh run rerun --failed lottery to hit green.
- Signal erosion: engineers learn to ignore "1 fail" as noise. Real regressions hide in the noise.
Proposed approaches (ranked by ROI)
- Retry-failed-tests-once inside
bun test (custom reporter or wrapper) — masks intermittents, surfaces consistent fails. The fastest ROI, ugly but pragmatic.
- Audit the 7 known flaky tests and fix them one-by-one. Each is likely a fixture-order or connection-reuse issue. Largest cleanup, but durable.
- Bump
bun test --timeout to 30s (up from our 15s). Absorbs more Blacksmith jitter. Lowest effort.
- Quarantine the known flaky test files behind
process.env.GENIE_TEST_STRICT=1. Default CI: skip them. Nightly / on-demand: run them. Gives fast merges without losing long-term signal.
Not this wish
pg-test-perf shipped the harness improvements that were the actual deliverable:
- Shared pgserve daemon (ram/template cache/lockfile reuse)
- Admin-connection reuse
- Lazy pgserve boot for non-PG tests
- macOS RAM-disk opt-in
- CI job split (unit-tests + pg-tests + umbrella)
The flake inventory is repo-wide baseline that predates the wish — confirmed by observing dev's own CI runs (e.g., run 24794680717) failing on the same kind of intermittents before this branch existed.
Assign to whoever owns the flaky test files. Tag: flaky-test, ci-health.
Summary
After landing wish pg-test-perf, PG Tests (serial, 208 files, 3626 tests) consistently shows 1-3 flaky failures per run, with the failing test different every time. Across 7 consecutive CI runs on PR #1317:
(unnamed)beforeEach timeoutNever the same test twice. That's the classic signature of environmental noise — Blacksmith runner jitter, pgserve cold-start variance, NOTIFY/LISTEN timing, fixture-order dependence. Not a deterministic code bug.
Impact
gh run rerun --failedlottery to hit green.Proposed approaches (ranked by ROI)
bun test(custom reporter or wrapper) — masks intermittents, surfaces consistent fails. The fastest ROI, ugly but pragmatic.bun test --timeoutto 30s (up from our 15s). Absorbs more Blacksmith jitter. Lowest effort.process.env.GENIE_TEST_STRICT=1. Default CI: skip them. Nightly / on-demand: run them. Gives fast merges without losing long-term signal.Not this wish
pg-test-perf shipped the harness improvements that were the actual deliverable:
The flake inventory is repo-wide baseline that predates the wish — confirmed by observing dev's own CI runs (e.g., run 24794680717) failing on the same kind of intermittents before this branch existed.
Assign to whoever owns the flaky test files. Tag: flaky-test, ci-health.