Skip to content

fix: scope test temp dirs to RUNNER_TEMP instead of globbing shared /tmp#5239

Merged
paolino merged 1 commit into
masterfrom
fix/cleanup-temp-directories
Apr 20, 2026
Merged

fix: scope test temp dirs to RUNNER_TEMP instead of globbing shared /tmp#5239
paolino merged 1 commit into
masterfrom
fix/cleanup-temp-directories

Conversation

@paolino
Copy link
Copy Markdown
Collaborator

@paolino paolino commented Apr 10, 2026

Problem

The CI build host (zur1-s-d-029) runs ~28 GHA self-hosted runners
as a single user against one shared tmpfs /tmp. Any job running
rm -rf /tmp/e2e-* /tmp/test-cluster* on if: always() will wipe
sibling runners' live VolatileDB files.

Under UTxO-HD (cardano-node >= 10.7.0) the consensus layer re-opens
VolatileDB files by path through fs-api, so an unlinked
blocks-*.dat now crashes the node with
ApiMisuse (ClosedDBError (UnexpectedFailure (FileSystemError FsResourceDoesNotExist …)))
instead of tolerating the unlink via open fd's.

This reproduced on master (253d290bfd) — Conway Integration Tests
crashed at 15:46:43 UTC on 2026-04-20 with two pool nodes failing
simultaneously on /tmp/test-cluster436150/pool-*/db/volatile/blocks-0.dat.
See also upstream ouroboros-consensus#1991.

Fix

  • Set TMPDIR: ${{ runner.temp }} on every job that launches a local
    test cluster or E2E run. $RUNNER_TEMP is per-job, per-runner, and
    auto-cleaned by the runner service between jobs, so clusters live
    in a private directory that sibling runners cannot touch.
  • Drop the pre-existing dangerous cleanup in
    .github/workflows/linux-e2e.yml and .github/workflows/release.yml
    $RUNNER_TEMP makes them redundant.
  • Replace the shared fixed path TMPDIR: /tmp/gha-bench in
    linux-benchmarks.yml and restoration-benchmarks.yml for the
    same reason.

No new cleanup steps are introduced.

@paolino paolino force-pushed the fix/cleanup-temp-directories branch 4 times, most recently from 8db20df to 4b1e2bc Compare April 20, 2026 17:34
@paolino paolino changed the title fix: clean up temp directories in all workflows fix: scope test temp dirs to RUNNER_TEMP instead of globbing shared /tmp Apr 20, 2026
The build host runs ~28 GHA runners as a single user against one tmpfs
/tmp, so any job running `rm -rf /tmp/e2e-* /tmp/test-cluster*` on
`if: always()` wipes sibling runners' live VolatileDB files. Under
UTxO-HD (node >= 10.7.0) the consensus layer re-opens these files by
path and crashes with `FsResourceDoesNotExist` instead of tolerating
the unlink via open fd's, which manifested as the Conway Integration
Tests failure on master at 253d290.

Point TMPDIR at \$RUNNER_TEMP — per-job, per-runner, auto-cleaned by
the runner service — so test clusters live in a private directory that
sibling runners cannot touch. Drop the now-redundant (and dangerous)
`rm -rf /tmp/...` cleanup steps from linux-e2e and release. Replace
the shared `/tmp/gha-bench` fixed path in benchmarks with RUNNER_TEMP
for the same reason.
@paolino paolino force-pushed the fix/cleanup-temp-directories branch from 4b1e2bc to 1b4ffbd Compare April 20, 2026 17:51
@paolino paolino merged commit b681921 into master Apr 20, 2026
57 checks passed
@paolino paolino deleted the fix/cleanup-temp-directories branch April 20, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant