Skip to content

wsl2: keepAlive / provisionVM goroutines deadlock on unbuffered errCh after instance stop #4957

@mn-ram

Description

@mn-ram

Description

The keepAlive goroutine in pkg/driver/wsl2/vm_windows.go (and its sibling writer in provisionVM in the same file) writes to an unbuffered errCh without a context-aware select, so the writer blocks indefinitely once the hostagent stops draining errCh during shutdown — the same shape of bug fixed for the VZ driver in #4922 and for the WSL2 hot-loop in #4892, but on a different code path that those PRs did not touch.

// pkg/driver/wsl2/vm_windows.go
func keepAlive(ctx context.Context, distroName string, errCh chan<- error) {
    keepAliveCmd := exec.CommandContext(ctx, "wsl.exe", "-d", distroName, "bash", "-c",
        "nohup sleep 2147483647d >/dev/null 2>&1")
    go func() {
        if err := keepAliveCmd.Run(); err != nil {
            errCh <- fmt.Errorf("error running wsl keepAlive command: %w", err)
        }
    }()
}

errCh is allocated unbuffered in (*LimaWslDriver).Start (wsl_driver_windows.go:245):

errCh := make(chan error)

It has exactly one consumer, in (*HostAgent).startRoutinesAndWait:

select {
case driverErr := <-errCh:
    logrus.Infof("Driver stopped due to error: %q", driverErr)
case sig := <-a.signalCh:
    logrus.Infof("Received %s, shutting down the host agent", osutil.SignalName(sig))
}
// after this point, no more reads from errCh
if closeErr := a.close(); closeErr != nil { ... }
cancelHA()
return a.driver.Stop(ctx)

The race on every limactl stop of a WSL2 instance:

  1. SIGTERM lands; the outer select picks the signalCh arm. Nobody will read errCh again.
  2. cancelHA() cancels the driver ctx → exec.CommandContext sends SIGKILL to the wsl.exe subprocess.
  3. keepAliveCmd.Run() returns with signal: killed.
  4. The goroutine reaches errCh <- fmt.Errorf(...).
  5. Send blocks forever — unbuffered channel, no consumer, no case <-ctx.Done() fallback.

The goroutine remains parked on chan send for the rest of the hostagent process's lifetime, retaining the captured *exec.Cmd plus its stdio pipes.

The same hazard exists on the errCh <- fmt.Errorf(...) in provisionVM's goroutine (also vm_windows.go).

Reproduction (deterministic)

limactl start --vm-type=wsl2 --name=demo template://default-windows
# wait for ready
limactl stop demo

The bug is structural — unbuffered channel + no consumer + no ctx-aware send — so it fires on every stop. In an in-process test that cancels the driver ctx without exiting the process, a pprof.Lookup("goroutine") snapshot shows one goroutine parked at chan send in vm_windows.go:keepAlive.func1.

Fix

Mirror the VZ fix in #4922 inside the WSL2 driver:

  1. Add a trySendErr(ctx, errCh, err) helper that selects on ctx.Done() so writers cannot block once the consumer is gone.
  2. Use it at both writers in keepAlive and provisionVM.
  3. Buffer errCh to size 2 in (*LimaWslDriver).Start so the first shutdown-time error is captured even before trySendErr is reached.

PR follows.

Environment

Affects every WSL2 instance (vmType: wsl2) on Windows. Found by code inspection against master.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions