hostagent: gate Ready on raw-TCP SSH banner probe#4967
Conversation
The host agent fires its Ready event after essentialRequirements is
satisfied. Today the first SSH-related check runs a no-op script via
Lima's own SSH client, which has internal retry-on-failure behavior.
That makes the check coincidentally correct rather than a direct
statement about whether external clients can use the forwarded SSH port.
There is a race during VM bring-up where the host agent starts accepting
TCP connections on 127.0.0.1:<sshLocalPort> before guest sshd has bound
:22 inside the guest. A fresh external connection in that window gets
accepted, the proxy has no live peer, the host side closes, and the
client's first write fails with EPIPE ("Broken pipe").
The race is consistently reproducible: probing the forwarded port with
ssh-keyscan during `limactl start` shows 7-8s of EPIPE on restart
(vz, macOS aarch64, Ubuntu 24.04 cloudimg) and ~11s on first boot. On
well-provisioned hardware the subsequent requirements (user session,
ControlMaster, final boot scripts) coincidentally wait long enough that
the race has closed before Ready fires. With --plain mode, non-Linux
guests (which return immediately after the ssh requirement), or slower
hardware, the race can extend past `limactl start` returning, and
downstream tools that invoke ssh-keyscan / sftp / rsync immediately
after see EPIPE.
This commit adds a new essential requirement, "sshLocalPort serves an
SSH banner", at the head of the list. It opens a fresh raw TCP
connection to 127.0.0.1:<sshLocalPort>, reads the SSH identification
string per RFC4253 §4.2, and validates the prefix. No SSH client is
involved, so there is no internal retry that could mask the race.
waitForRequirements wraps the probe in its existing 3-second backoff
retry loop, so a transient EPIPE during bring-up just causes another
attempt rather than failing the start.
To support host-side native checks alongside the existing script-based
ones, the requirement struct gains a `check func(ctx) error` field;
waitForRequirement dispatches between the two (mutually exclusive per
requirement). context is now threaded through waitForRequirements so
the retry loop honors cancellation.
Pure-Go unit tests cover banner success, the SSH-1.99 legacy prefix,
accept-then-close (the exact race shape), wrong banner, no listener,
and a hung-write read-deadline case.
Signed-off-by: Weishi Z <amwish.zeng@gmail.com>
|
AI review has suggestions (nothing serious, but please take a look given that the PR is also being done by AI): https://jandubois.github.io/lima/20260512-213934-pr-4967.html |
Hey @jandubois thanks for the review! (And sorry this was intended to be a draft, not ready) |
| if err != nil { | ||
| return fmt.Errorf("read SSH banner from %s: %w", addr, err) | ||
| } | ||
| if !strings.HasPrefix(banner, "SSH-2.0-") && !strings.HasPrefix(banner, "SSH-1.99-") { |
There was a problem hiding this comment.
- Who is using SSH-1.99?
- What happens when SSH-2.1+ is released?
There was a problem hiding this comment.
Also, please consider submitting a GitHub issue prior to submitting a non-trivial PR
Summary
The host agent fires
ReadyafteressentialRequirementssucceeds. The first SSH check there runs a no-op script via Lima's own SSH client, which has internal retry-on-failure behavior — so the check passes when Lima's client can eventually authenticate, not necessarily when a fresh external connection on127.0.0.1:<sshLocalPort>can read the SSH banner.That coincidental correctness breaks for external tools (
ssh-keyscan,sftp,rsync, …) that connect immediately afterlimactl startreturns: the host agent has accepted on the forwarded port, the proxy into the guest has no peer yet (guest sshd hasn't bound:22), and the firstwrite()afterconnect()returnsEPIPE/Broken pipe.This PR adds a new essential requirement,
sshLocalPort serves an SSH banner, at the head of the list. It opens a raw TCP connection from the host agent, reads the SSH identification string per RFC 4253 §4.2, and validates the prefix (SSH-2.0-/SSH-1.99-). No SSH client is in the loop, so there is no internal retry that can mask the race;waitForRequirementswraps the probe in its existing 3 s-backoff retry loop, so a transient EPIPE during bring-up just causes another attempt.Why a new requirement instead of strengthening the existing
sshoneThe existing
sshrequirement runs a script viassh.ExecuteScript, which uses Lima's own SSH client. The client retries internally before surfacing failure towaitForRequirement, so a transient EPIPE on the host-side TCP forwarder is invisible to the requirement. Reworking the existing check to be native Go would change its semantics for everyone; adding a new check that asserts a different invariant ("the public-facing path is end-to-end usable for a cold external client") keeps the existing behavior and stacks a stricter gate on top.Code shape
pkg/hostagent/requirements.gorequirementgains acheck func(ctx context.Context) errorfield. Exactly one ofscript/checkmust be set per requirement.waitForRequirementdispatches tor.check(ctx)when present, otherwise the existing script-via-SSH path. The two paths are mutually exclusive.waitForRequirementsnow takes acontext.Context. The 3 s backoff between attempts isselect-ed againstctx.Done()so cancellation propagates.probeSSHBannerOnLocalPort(ctx, port):net.Dialer.DialContext(5 s timeout) →SetReadDeadline(now+5s)→bufio.NewReader.ReadString('\n')→ prefix check. ~25 lines.essentialRequirements(). The remaining requirements are unchanged.pkg/hostagent/hostagent.go: three call sites ofwaitForRequirementsupdated to passctx.pkg/hostagent/requirements_test.go(new): six pure-Go tests for the probe — banner OK,SSH-1.99legacy prefix, accept-then-close (the exact race shape), wrong banner, no listener, and a hung-write read-deadline test. No VM needed.Empirical findings
Repro: probing
127.0.0.1:<sshLocalPort>withssh-keyscanevery 50 ms duringlimactl start, on macOS aarch64 +vzdriver + Ubuntu 24.04 cloudimg + a pinnedssh.localPort. 10 restart cycles + 1 first-boot cycle.limactl startreturnedThe underlying race window is present every single cycle. On this hardware/template combination, the requirements that follow
ssh—user session is ready for ssh,Explicitly start ssh ControlMaster, andboot scripts must have finished— happen to wait long enough that the race has closed beforeReadyfires. The race-after-Ready scenario was not reproduced on this rig; downstream reports of it (imbue-ai/mngr#1580, pyinfra/ansible retry loops) suggest it surfaces on slower hardware, with--plainmode (which skips the post-sshchecks), or with non-Linux guests (whereessentialRequirementsreturns afterssh+ControlMasteronly — see the earlyreturn reqat lines 179–182).This change makes the wait explicit and signal-driven instead of relying on later checks to coincidentally provide enough buffer.
Test plan
go test ./pkg/hostagent/...— all pass (6 new probe tests, deadline test confirms ~5 s timing)golangci-lint run ./...— 0 issues (with the pinnedv2.12.1fromhack/tools)gofmt -l— cleango vet ./...— cleango build ./cmd/limactl— cleanessential requirement 1 of 4: "sshLocalPort serves an SSH banner"and gatesReady. No boot-time regression on this hardware.Out of scope
sshrequirement with a native check. Keeping both preserves the existing behavior and stacks a stricter external-facing check on top.Signed-off-by: Weishi Z amwish.zeng@gmail.com