Skip to content

hostagent: gate Ready on raw-TCP SSH banner probe#4967

Draft
WeishiZ wants to merge 1 commit into
lima-vm:masterfrom
WeishiZ:hostagent-ssh-banner-readiness
Draft

hostagent: gate Ready on raw-TCP SSH banner probe#4967
WeishiZ wants to merge 1 commit into
lima-vm:masterfrom
WeishiZ:hostagent-ssh-banner-readiness

Conversation

@WeishiZ
Copy link
Copy Markdown

@WeishiZ WeishiZ commented May 12, 2026

Summary

The host agent fires Ready after essentialRequirements succeeds. The first SSH check there runs a no-op script via Lima's own SSH client, which has internal retry-on-failure behavior — so the check passes when Lima's client can eventually authenticate, not necessarily when a fresh external connection on 127.0.0.1:<sshLocalPort> can read the SSH banner.

That coincidental correctness breaks for external tools (ssh-keyscan, sftp, rsync, …) that connect immediately after limactl start returns: the host agent has accepted on the forwarded port, the proxy into the guest has no peer yet (guest sshd hasn't bound :22), and the first write() after connect() returns EPIPE / Broken pipe.

This PR adds a new essential requirement, sshLocalPort serves an SSH banner, at the head of the list. It opens a raw TCP connection from the host agent, reads the SSH identification string per RFC 4253 §4.2, and validates the prefix (SSH-2.0- / SSH-1.99-). No SSH client is in the loop, so there is no internal retry that can mask the race; waitForRequirements wraps the probe in its existing 3 s-backoff retry loop, so a transient EPIPE during bring-up just causes another attempt.

Why a new requirement instead of strengthening the existing ssh one

The existing ssh requirement runs a script via ssh.ExecuteScript, which uses Lima's own SSH client. The client retries internally before surfacing failure to waitForRequirement, so a transient EPIPE on the host-side TCP forwarder is invisible to the requirement. Reworking the existing check to be native Go would change its semantics for everyone; adding a new check that asserts a different invariant ("the public-facing path is end-to-end usable for a cold external client") keeps the existing behavior and stacks a stricter gate on top.

Code shape

  • pkg/hostagent/requirements.go
    • requirement gains a check func(ctx context.Context) error field. Exactly one of script / check must be set per requirement.
    • waitForRequirement dispatches to r.check(ctx) when present, otherwise the existing script-via-SSH path. The two paths are mutually exclusive.
    • waitForRequirements now takes a context.Context. The 3 s backoff between attempts is select-ed against ctx.Done() so cancellation propagates.
    • New probeSSHBannerOnLocalPort(ctx, port): net.Dialer.DialContext (5 s timeout) → SetReadDeadline(now+5s)bufio.NewReader.ReadString('\n') → prefix check. ~25 lines.
    • New requirement added at the head of essentialRequirements(). The remaining requirements are unchanged.
  • pkg/hostagent/hostagent.go: three call sites of waitForRequirements updated to pass ctx.
  • pkg/hostagent/requirements_test.go (new): six pure-Go tests for the probe — banner OK, SSH-1.99 legacy prefix, accept-then-close (the exact race shape), wrong banner, no listener, and a hung-write read-deadline test. No VM needed.

Empirical findings

Repro: probing 127.0.0.1:<sshLocalPort> with ssh-keyscan every 50 ms during limactl start, on macOS aarch64 + vz driver + Ubuntu 24.04 cloudimg + a pinned ssh.localPort. 10 restart cycles + 1 first-boot cycle.

scenario invoked → returned EPIPE window EPIPE after limactl start returned
restart × 10 10–11 s 7–8 s during boot, closes ~3.4 s before return 0 across ~1600 post-return samples
first boot × 1 38.9 s ~11 s during boot, closes ~28 s before return 0

The underlying race window is present every single cycle. On this hardware/template combination, the requirements that follow sshuser session is ready for ssh, Explicitly start ssh ControlMaster, and boot scripts must have finished — happen to wait long enough that the race has closed before Ready fires. The race-after-Ready scenario was not reproduced on this rig; downstream reports of it (imbue-ai/mngr#1580, pyinfra/ansible retry loops) suggest it surfaces on slower hardware, with --plain mode (which skips the post-ssh checks), or with non-Linux guests (where essentialRequirements returns after ssh + ControlMaster only — see the early return req at lines 179–182).

This change makes the wait explicit and signal-driven instead of relying on later checks to coincidentally provide enough buffer.

Test plan

  • go test ./pkg/hostagent/... — all pass (6 new probe tests, deadline test confirms ~5 s timing)
  • golangci-lint run ./... — 0 issues (with the pinned v2.12.1 from hack/tools)
  • gofmt -l — clean
  • go vet ./... — clean
  • go build ./cmd/limactl — clean
  • End-to-end: 10 stop/start cycles + 1 first-boot with both the unpatched binary and a binary including this patch. With the patch the new requirement appears in the boot log as essential requirement 1 of 4: "sshLocalPort serves an SSH banner" and gates Ready. No boot-time regression on this hardware.

Out of scope

  • A vsock-based guest-side SSH readiness signal via the Lima guest agent. That would be the deeper fix but requires guest cooperation and changes more code.
  • Replacing the existing script-based ssh requirement with a native check. Keeping both preserves the existing behavior and stacks a stricter external-facing check on top.

Signed-off-by: Weishi Z amwish.zeng@gmail.com

The host agent fires its Ready event after essentialRequirements is
satisfied. Today the first SSH-related check runs a no-op script via
Lima's own SSH client, which has internal retry-on-failure behavior.
That makes the check coincidentally correct rather than a direct
statement about whether external clients can use the forwarded SSH port.

There is a race during VM bring-up where the host agent starts accepting
TCP connections on 127.0.0.1:<sshLocalPort> before guest sshd has bound
:22 inside the guest. A fresh external connection in that window gets
accepted, the proxy has no live peer, the host side closes, and the
client's first write fails with EPIPE ("Broken pipe").

The race is consistently reproducible: probing the forwarded port with
ssh-keyscan during `limactl start` shows 7-8s of EPIPE on restart
(vz, macOS aarch64, Ubuntu 24.04 cloudimg) and ~11s on first boot. On
well-provisioned hardware the subsequent requirements (user session,
ControlMaster, final boot scripts) coincidentally wait long enough that
the race has closed before Ready fires. With --plain mode, non-Linux
guests (which return immediately after the ssh requirement), or slower
hardware, the race can extend past `limactl start` returning, and
downstream tools that invoke ssh-keyscan / sftp / rsync immediately
after see EPIPE.

This commit adds a new essential requirement, "sshLocalPort serves an
SSH banner", at the head of the list. It opens a fresh raw TCP
connection to 127.0.0.1:<sshLocalPort>, reads the SSH identification
string per RFC4253 §4.2, and validates the prefix. No SSH client is
involved, so there is no internal retry that could mask the race.
waitForRequirements wraps the probe in its existing 3-second backoff
retry loop, so a transient EPIPE during bring-up just causes another
attempt rather than failing the start.

To support host-side native checks alongside the existing script-based
ones, the requirement struct gains a `check func(ctx) error` field;
waitForRequirement dispatches between the two (mutually exclusive per
requirement). context is now threaded through waitForRequirements so
the retry loop honors cancellation.

Pure-Go unit tests cover banner success, the SSH-1.99 legacy prefix,
accept-then-close (the exact race shape), wrong banner, no listener,
and a hung-write read-deadline case.

Signed-off-by: Weishi Z <amwish.zeng@gmail.com>
@jandubois
Copy link
Copy Markdown
Member

AI review has suggestions (nothing serious, but please take a look given that the PR is also being done by AI): https://jandubois.github.io/lima/20260512-213934-pr-4967.html

@WeishiZ WeishiZ marked this pull request as draft May 13, 2026 07:57
@WeishiZ
Copy link
Copy Markdown
Author

WeishiZ commented May 13, 2026

AI review has suggestions (nothing serious, but please take a look given that the PR is also being done by AI): https://jandubois.github.io/lima/20260512-213934-pr-4967.html

Hey @jandubois thanks for the review! (And sorry this was intended to be a draft, not ready)
I've been trying to find a way to reproduce the race consistently so I can capture it in a sort of unit test.
I plan to work on that a bit more. And I'll address the reviews before marking this ready.

if err != nil {
return fmt.Errorf("read SSH banner from %s: %w", addr, err)
}
if !strings.HasPrefix(banner, "SSH-2.0-") && !strings.HasPrefix(banner, "SSH-1.99-") {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Who is using SSH-1.99?
  • What happens when SSH-2.1+ is released?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please consider submitting a GitHub issue prior to submitting a non-trivial PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants