Skip to content

os/exec: syscall.forkExec hang when spawning multiple processes concurrently on darwin #61080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JacobOaks opened this issue Jun 29, 2023 · 8 comments
Assignees
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Milestone

Comments

@JacobOaks
Copy link

JacobOaks commented Jun 29, 2023

What version of Go are you using (go version)?

$ go version
go version go1.21rc2 darwin/arm64

Does this issue reproduce with the latest release?

Yes, it reproduces since this change.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=''
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/joaks/Library/Caches/go-build'
GOENV='/Users/joaks/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/joaks/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/joaks/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/Users/joaks/go/src/github.com/golang/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/Users/joaks/go/src/github.com/golang/go/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.21rc2'
GCCGO='gccgo'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/xj/2wbc4_xn293gkz7_6rzxsz5w0000gn/T/go-build3971151895=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

We have a code generator that we use at Uber that spawns up many concurrent child processes that communicate via stdin & stdout. While doing internal testing with Go1.21rc2, we noticed the code generator hanging. A very minimal runnable repro can be found in this repository: https://github.com/JacobOaks/Go1.21rc2-syscall.forkExec-hanging-repro.

Essentially, we are spinning up a bunch of external processes with stdin & stdout pipes concurrently. Something like (see link above for full repro):

func spawn(binaryPath string, n int) []*client {
	clients := make([]*client, n)
	for i := 0; i < n; i++ {
		clients[i] = newClient(binaryPath)
	}

	var wg sync.WaitGroup
	for i := 0; i < n; i++ {
		wg.Add(1)
		client := clients[i]
		go func() {
			if err := client.start(); err != nil {
				panic("TODO")
			}
			wg.Done()
		}()
	}
	wg.Wait()
	return clients
}

type client struct {
	cmd *exec.Cmd

	stdout io.ReadCloser
	stdin  io.WriteCloser
}

func newClient(binary string) *client {
	return &client{
		cmd: exec.Command(binary),
	}
}

func (c *client) start() error {
	var err error
	c.stdout, err = c.cmd.StdoutPipe()
	if err != nil {
		return fmt.Errorf("create stdout pipe: %w", err)
	}
	c.stdin, err = c.cmd.StdinPipe()
	if err != nil {
		return fmt.Errorf("create stdin pipe: %w", err)
	}
	if err = c.cmd.Start(); err != nil {
		return fmt.Errorf("run cmd: %w", err)
	}
	return nil
}

Attaching delve to the hanging process, we notice the issue occurs in cmd.Start, where syscall.forkExec seems to hang:

(dlv) grs
  Goroutine 1 - User: /Users/joaks/go/src/github.com/golang/go/src/runtime/sema.go:62 sync.runtime_Semacquire (0x1026bf57c) [semacquire]
  Goroutine 2 - User: /Users/joaks/go/src/github.com/golang/go/src/runtime/proc.go:399 runtime.gopark (0x102695198) [force gc (idle)]
  Goroutine 3 - User: /Users/joaks/go/src/github.com/golang/go/src/runtime/proc.go:399 runtime.gopark (0x102695198) [GC sweep wait]
  Goroutine 4 - User: /Users/joaks/go/src/github.com/golang/go/src/runtime/proc.go:399 runtime.gopark (0x102695198) [GC scavenge wait]
  Goroutine 5 - User: /Users/joaks/go/src/github.com/golang/go/src/runtime/proc.go:399 runtime.gopark (0x102695198) [finalizer wait]
  Goroutine 12 - User: /Users/joaks/go/src/github.com/golang/go/src/runtime/sys_darwin.go:24 syscall.syscall (0x1026bfaf8) (thread 18311682) [timer goroutine (idle)]
[6 goroutines]
(dlv) gr 12
Switched from 0 to 12 (thread 18311682)
(dlv) stack
 0  0x000000018f884acc in ???
    at ?:-1
 1  0x00000001026c0b58 in runtime.systemstack_switch
    at /Users/joaks/go/src/github.com/golang/go/src/runtime/asm_arm64.s:200
 2  0x00000001026b19dc in runtime.libcCall
    at /Users/joaks/go/src/github.com/golang/go/src/runtime/sys_libc.go:49
 3  0x00000001026bfaf8 in syscall.syscall
    at /Users/joaks/go/src/github.com/golang/go/src/runtime/sys_darwin.go:24
 4  0x00000001026daa5c in syscall.readlen
    at /Users/joaks/go/src/github.com/golang/go/src/syscall/syscall_darwin.go:242
 5  0x00000001026d9c30 in syscall.forkExec
    at /Users/joaks/go/src/github.com/golang/go/src/syscall/exec_unix.go:217
 6  0x00000001026e9628 in syscall.StartProcess
    at /Users/joaks/go/src/github.com/golang/go/src/syscall/exec_unix.go:334
 7  0x00000001026e9628 in os.startProcess
    at /Users/joaks/go/src/github.com/golang/go/src/os/exec_posix.go:54
 8  0x00000001026e9340 in os.StartProcess
    at /Users/joaks/go/src/github.com/golang/go/src/os/exec.go:111
 9  0x00000001026fc534 in os/exec.(*Cmd).Start
    at /Users/joaks/go/src/github.com/golang/go/src/os/exec/exec.go:693
10  0x00000001026ff368 in main.(*client).start
    at ./server/main.go:105
11  0x00000001026fefa8 in main.spawn.func1
    at ./server/main.go:46
12  0x00000001026c3024 in runtime.goexit
    at /Users/joaks/go/src/github.com/golang/go/src/runtime/asm_arm64.s:1197

This behavior is flaky and in our investigation, only appears on Go1.21rc2 on darwin-arm64.

git bisect indicated this change to be the culprit.

What did you expect to see?

I would expect the program in the linked repro to not hang, as in Go 1.20.

What did you see instead?

It occasionally hangs, see above.

@ianlancetaylor
Copy link
Contributor

I don't understand the test case in the GitHub repository. It starts a bunch of processes with a call StdinPipe. Each process reads 10 bytes from standard input. But nothing ever writes to the pipe returned by StdinPipe. So each process hangs without exiting. So the main process waits forever for the subprocesses to exit.

The program also hangs when built with Go 1.20 or Go 1.19.

Are you sure that the GitHub repo is a correct example of the problem?

@JacobOaks
Copy link
Author

JacobOaks commented Jun 30, 2023

Are you sure that the GitHub repo is a correct example of the problem?

Apologies, you are correct. I had trimmed the repro to try to make it concise by removing the actual communication between the processes since the hanging I'm talking about happens before that anyways. But I ended up creating a bad example that always hangs. I have updated the code and the original issue description to reflect the correct repro.

@bcmills
Copy link
Contributor

bcmills commented Jun 30, 2023

Hrm. I wonder if this is related to the other darwin pipe hang we're seeing (the cluster listed in #54461 (comment)). 🤔

(Those failures are on darwin/amd64 rather than darwin/arm64 and had been occurring well before https://go.dev/cl/421441 was merged, but maybe that change merely increased the failure rate of an existing bug...)

@bcmills bcmills added OS-Darwin NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jun 30, 2023
@bcmills
Copy link
Contributor

bcmills commented Jun 30, 2023

Notably, that stack trace seems like a solid match for the one observed in #59892 (comment).

@bcmills
Copy link
Contributor

bcmills commented Jun 30, 2023

@ianlancetaylor, I think we need an exclusive lock on syscall.ForkLock on aix and darwin, where the pipes created by syscall.forkExecPipe do not already have the CLOEXEC bit set.
(https://cs.opensource.google/go/go/+/master:src/syscall/forkpipe.go)

On systems with Pipe2 it is safe to fork concurrently, but on systems without it we need to avoid the situation where one child process accidentally inherits one end or the other of a pipe intended for a different child.

@bcmills bcmills added this to the Go1.21 milestone Jun 30, 2023
@bcmills
Copy link
Contributor

bcmills commented Jun 30, 2023

Marking as release-blocker because this appears to be a regression in Go 1.21 on a first-class platform.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/507355 mentions this issue: syscall: serialize locks on ForkLock on platforms where forkExecPipe is not atomic

@bcmills bcmills added the NeedsFix The path to resolution is known, but the work has not been done. label Jun 30, 2023
@gopherbot gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 30, 2023
@bcmills bcmills self-assigned this Jun 30, 2023
@bcmills bcmills changed the title os/exec: syscall.forkExec hanging when spawning multiple processes concurrently w/ stdout/stdin pipes os/exec: syscall.forkExec hang when spawning multiple processes concurrently on darwin Jun 30, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/508755 mentions this issue: os/exec: ignore context.Canceled errors in TestConcurrentExec

gopherbot pushed a commit that referenced this issue Jul 11, 2023
We cancel the Context to unblock the test as soon as all of the "exit"
processes have completed. If that happens to occur before all of the
"hang" processes have started, the Start calls may fail with
context.Canceled.

Since those errors are possible in normal operation of the test,
ignore them.

Fixes #61277.
Updates #61080.

Change-Id: I20db083ec89ca88eb085ceb2892b9f87f83705ac
Reviewed-on: https://go-review.googlesource.com/c/go/+/508755
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Bryan Mills <[email protected]>
Auto-Submit: Bryan Mills <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
bradfitz pushed a commit to tailscale/go that referenced this issue Jul 15, 2023
…is not atomic

In CL 421441, we changed syscall to allow concurrent calls to
forkExec.

On platforms that support the pipe2 syscall that is the right
behavior, because pipe2 atomically opens the pipe with CLOEXEC already
set.

However, on platforms that do not support pipe2 (currently aix and
darwin), syscall.forkExecPipe is not atomic, and the pipes do not
initially have CLOEXEC set. If two calls to forkExec proceed
concurrently, a pipe intended for one child process can be
accidentally inherited by the other. If the process is long-lived, the
pipe can be held open unexpectedly and prevent the parent process from
reaching EOF reading the child's status from the pipe.

Fixes golang#61080.
Updates golang#23558.
Updates golang#54162.

Change-Id: I83edcc80674ff267a39d06260c5697c654ff5a4b
Reviewed-on: https://go-review.googlesource.com/c/go/+/507355
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
Run-TryBot: Bryan Mills <[email protected]>
Auto-Submit: Bryan Mills <[email protected]>
bradfitz pushed a commit to tailscale/go that referenced this issue Jul 15, 2023
We cancel the Context to unblock the test as soon as all of the "exit"
processes have completed. If that happens to occur before all of the
"hang" processes have started, the Start calls may fail with
context.Canceled.

Since those errors are possible in normal operation of the test,
ignore them.

Fixes golang#61277.
Updates golang#61080.

Change-Id: I20db083ec89ca88eb085ceb2892b9f87f83705ac
Reviewed-on: https://go-review.googlesource.com/c/go/+/508755
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Bryan Mills <[email protected]>
Auto-Submit: Bryan Mills <[email protected]>
Reviewed-by: Ian Lance Taylor <[email protected]>
@golang golang deleted a comment from gopherbot Apr 10, 2024
@golang golang locked and limited conversation to collaborators Apr 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done. OS-Darwin release-blocker
Projects
None yet
Development

No branches or pull requests

4 participants