Skip to content

PG Tests baseline flake rate ≈1/3626 per run — surfaced by wish pg-test-perf #1335

@namastex888

Description

@namastex888

Summary

After landing wish pg-test-perf, PG Tests (serial, 208 files, 3626 tests) consistently shows 1-3 flaky failures per run, with the failing test different every time. Across 7 consecutive CI runs on PR #1317:

Run Wall-clock Flakes Which tests
1 145s (pre-timeout-bump) 3 events-stream drain, qa-runner × 2
2 64.9s 2 Team Manager createTeam, turn-close happy path
3 67.5s 1 (unnamed) beforeEach timeout
4 87.9s 2 serve lifecycle, events-stream cursor
5 64.9s 1 Group 1 observability migrations
6 76.1s 1 pg > register and get
7 68.7s 1 + 2 errors pg > syncWishes upserts

Never the same test twice. That's the classic signature of environmental noise — Blacksmith runner jitter, pgserve cold-start variance, NOTIFY/LISTEN timing, fixture-order dependence. Not a deterministic code bug.

Impact

  • Merge velocity: every PR needs a gh run rerun --failed lottery to hit green.
  • Signal erosion: engineers learn to ignore "1 fail" as noise. Real regressions hide in the noise.

Proposed approaches (ranked by ROI)

  1. Retry-failed-tests-once inside bun test (custom reporter or wrapper) — masks intermittents, surfaces consistent fails. The fastest ROI, ugly but pragmatic.
  2. Audit the 7 known flaky tests and fix them one-by-one. Each is likely a fixture-order or connection-reuse issue. Largest cleanup, but durable.
  3. Bump bun test --timeout to 30s (up from our 15s). Absorbs more Blacksmith jitter. Lowest effort.
  4. Quarantine the known flaky test files behind process.env.GENIE_TEST_STRICT=1. Default CI: skip them. Nightly / on-demand: run them. Gives fast merges without losing long-term signal.

Not this wish

pg-test-perf shipped the harness improvements that were the actual deliverable:

  • Shared pgserve daemon (ram/template cache/lockfile reuse)
  • Admin-connection reuse
  • Lazy pgserve boot for non-PG tests
  • macOS RAM-disk opt-in
  • CI job split (unit-tests + pg-tests + umbrella)

The flake inventory is repo-wide baseline that predates the wish — confirmed by observing dev's own CI runs (e.g., run 24794680717) failing on the same kind of intermittents before this branch existed.

Assign to whoever owns the flaky test files. Tag: flaky-test, ci-health.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions