Skip to content

io: Copy to a pipe prevents process exit (Go 1.24rc2 on Linux regression) #71375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kolyshkin opened this issue Jan 22, 2025 · 19 comments
Closed
Assignees
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@kolyshkin
Copy link
Contributor

Go version

go version go1.24rc2 linux/amd64

Output of go env in your module/workspace:

AR='ar'
CC='gcc'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_ENABLED='1'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
CXX='g++'
GCCGO='gccgo'
GO111MODULE=''
GOAMD64='v1'
GOARCH='amd64'
GOAUTH='netrc'
GOBIN=''
GOCACHE='/home/kir/.cache/go-build'
GOCACHEPROG=''
GODEBUG=''
GOENV='/home/kir/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFIPS140='off'
GOFLAGS=''
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build957638587=/tmp/go-build -gno-record-gcc-switches'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMOD='/dev/null'
GOMODCACHE='/home/kir/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/kir/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/kir/sdk/go1.24rc2'
GOSUMDB='sum.golang.org'
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/kir/.config/go/telemetry'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/kir/sdk/go1.24rc2/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.24rc2'
GOWORK=''
PKG_CONFIG='pkg-config'

What did you do?

There is a regression in Go 1.24rc2 (git-bisect points to 1) caused by a (not yet confirmed) Linux kernel bug with sendmail(2)/splice(2), which I just reported 2.

In short, when sendfile(2) or splice(2) is used to copy data to a pipe, and another process is having the other end of this file, this prevents that other process from exit. You can get a short C repro from 2, and here's a Go repro:

package main

import (
	"io"
	"log"
	"os"
	"os/exec"
)

func main() {
	// Create a pipe for stdin
	r, w, err := os.Pipe()
	if err != nil {
		log.Fatal(err)
	}

	// Create a short-lived process (like ps) with stdin connected to pipe
	cmd := exec.Command("ps")
	cmd.Stdin = r
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr

	if err := cmd.Start(); err != nil {
		log.Fatal(err)
	}
	log.Print("Run child pid=", cmd.Process.Pid)

	// Close read end in parent.
	r.Close()

	// Copy from stdin to pipe - this should trigger sendfile in Go 1.24.
	go func() {
		_, err = io.Copy(w, os.Stdin)
		if err != nil {
			log.Fatal("io.Copy: ", err)
		}
	}()

	if err := cmd.Wait(); err != nil {
		log.Fatal("wait: ", err)
	}
}

What did you see happen?

When the above repro is run with go1.24rc2, the process is not exiting:

[kir@kir-tp1 sendfile-vs-pipe]$ go1.24rc2 run repro.go 
2025/01/21 18:32:54 Run child pid=2180908
    PID TTY          TIME CMD
  63304 pts/1    00:00:03 bash
2180738 pts/1    00:00:00 go1.24rc2
2180747 pts/1    00:00:00 go
2180902 pts/1    00:00:00 repro
2180908 pts/1    00:00:00 ps

(and the process hangs here).

NOTE if you can't repro this, you probably pressed Enter in the terminal. Do not do anything with this terminal!.

In a different terminal, you can check the status of the child process (using the pid from the above output):

[kir@kir-tp1 linux]$ ps -f -p 2180908
UID          PID    PPID  C STIME TTY          TIME CMD
kir      2180908 2180902  0 18:32 pts/1    00:00:00 [ps]

As you can see, ps thinks that the process is a kernel thread. This happens because the process is half-exited and some /proc/$PID/ entries (root, cwd, `exe) are no longer valid, like it is with kernel threads.

Now, to see where the process is stuck:

[kir@kir-tp1 linux]$ sudo cat /proc/2180908/stack
[<0>] pipe_release+0x1f/0x100
[<0>] __fput+0xde/0x2a0
[<0>] task_work_run+0x59/0x90
[<0>] do_exit+0x309/0xab0
[<0>] do_group_exit+0x30/0x80
[<0>] __x64_sys_exit_group+0x18/0x20
[<0>] x64_sys_call+0x14b4/0x14c0
[<0>] do_syscall_64+0x82/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

What did you expect to see?

No stuck process.

With an older Go version, everything works as it should:

[kir@kir-tp1 sendfile-vs-pipe]$ go1.23.4 run repro.go 
2025/01/21 18:38:16 Run child pid=2181265
    PID TTY          TIME CMD
  63304 pts/1    00:00:03 bash
2181065 pts/1    00:00:00 go1.23.4
2181071 pts/1    00:00:00 go
2181259 pts/1    00:00:00 repro
2181265 pts/1    00:00:00 ps
[kir@kir-tp1 sendfile-vs-pipe]$ 
@kolyshkin
Copy link
Contributor Author

First, let's wait for kernel folks to confirm this.

Thinking about a workaround, my best solution for now is to not use sendfile(2) on Linux when writing to a pipe.

@kolyshkin kolyshkin changed the title io: Copy to a pipe prevents process exit (Go 1.24rc2 regression) io: Copy to a pipe prevents process exit (Go 1.24rc2 on Linux regression) Jan 22, 2025
@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Jan 22, 2025
@prattmic
Copy link
Member

cc @golang/runtime

@prattmic prattmic added this to the Go1.24 milestone Jan 22, 2025
@prattmic
Copy link
Member

Thanks for testing the RC and the great bug report!

I am also curious to see what the kernel folks have to say. My immediate thought was that exit_group should never block indefinitely and anything causing it to do so is clearly a kernel bug. That said, man 2 _exit says "_exit() does close open file descriptors, and this may cause an unknown delay, waiting for pending output to finish." So that does not agree with my initial interpretation.

Regardless, even if it is considered a kernel bug, we'll have to work around it for older kernels. It seems that the obvious options are:

  1. Revert https://go.dev/cl/603295.
  2. Add a special exception to avoid using sendfile just with pipe FDs.

Given how late we are in the release freeze, I'd lean towards (1) as the less risky option, and then we could consider applying a new version of the CL for 1.25.

cc @panjf2000

@mknyszek mknyszek added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 22, 2025
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 22, 2025
@panjf2000
Copy link
Member

I'll work on this.

@JoesSon72

This comment has been minimized.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/644015 mentions this issue: os: don't use sendfile to a pipe on Linux

@prattmic
Copy link
Member

I spent more time thinking about this today.

Thanks @kolyshkin for sending the fix CL https://go.dev/cl/644015. The fix seems fairly straightforward and not too concerning on its own, though I am concerned that we might find more (non-pipe) edge cases that break sendfile. That makes me continue to lean towards a revert, though I acknowledge that this bug took nearly 6mo to surface, so waiting another 6mo isn't guaranteed to uncover more bugs.

Looking at the rationale of the original CL, if I understand correctly, most calls will use copy_file_range and never reach sendfile. We will use the sendfile fallback for:

  • Linux <5.3, when copy_file_range was more restricted.
  • Linux >=5.3, on remaining failure modes of copy_file_range, which include:
    • Either file is not a regular file (so pipe, socket, etc).
    • Synthetic file systems missing file sizes (procfs, sysfs, etc). Edit: :O this bug is fixed in 5.19!
    • Various filesystems that explicitly don't support copy_file_range.

It seems that pipe and socket copies would be the most valuable case here. We have to drop pipes, so that leaves sockets. Note that this change is only for os.File. net already explicitly uses sendfile. Putting a socket into os.File is probably not terribly common (UDS on the filesystem I'd guess is the most common case).

All that said, I think I am in agreement that this feature is nice to have, but not critical enough to warrant a last minute fix, so revert still feels best to me.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/644895 mentions this issue: Revert "os: employ sendfile(2) for file-to-file copying on Linux when needed"

@seankhliao
Copy link
Member

#71459 was a different report of blocking, i think on a pty?

@kolyshkin
Copy link
Contributor Author

#71459 was a different report of blocking, i think on a pty?

It's the same under the hood as the reproducer in this issue description, with io.Copy from a terminal to a pipe.

Either https://go.dev/cl/644015 or https://go.dev/cl/644895 fixes it.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/644935 mentions this issue: os: employ sendfile(2) only when the target is regular file on Linux

@panjf2000
Copy link
Member

I spent more time thinking about this today.

Thanks @kolyshkin for sending the fix CL go.dev/cl/644015. The fix seems fairly straightforward and not too concerning on its own, though I am concerned that we might find more (non-pipe) edge cases that break sendfile. That makes me continue to lean towards a revert, though I acknowledge that this bug took nearly 6mo to surface, so waiting another 6mo isn't guaranteed to uncover more bugs.

Looking at the rationale of the original CL, if I understand correctly, most calls will use copy_file_range and never reach sendfile. We will use the sendfile fallback for:

  • Linux <5.3, when copy_file_range was more restricted.

  • Linux >=5.3, on remaining failure modes of copy_file_range, which include:

    • Either file is not a regular file (so pipe, socket, etc).
    • Synthetic file systems missing file sizes (procfs, sysfs, etc). Edit: :O this bug is fixed in 5.19!
    • Various filesystems that explicitly don't support copy_file_range.

It seems that pipe and socket copies would be the most valuable case here. We have to drop pipes, so that leaves sockets. Note that this change is only for os.File. net already explicitly uses sendfile. Putting a socket into os.File is probably not terribly common (UDS on the filesystem I'd guess is the most common case).

All that said, I think I am in agreement that this feature is nice to have, but not critical enough to warrant a last minute fix, so revert still feels best to me.

Actually, CL 603295 was intended to handle the case of data transmission between regular files originally, the real "nice to have" case is pipe, I'd suggest we just drop anything other than regular files, like we did for Solaris: CL 605355, CL 606135. @prattmic

cc @ianlancetaylor

@prattmic
Copy link
Member

Huh, interesting. What is the common case where we expect copy_file_range to fail for regular files and thus fall back to sendfile?

Is it cross-filesystem copies? Looking at the implementation again, it sounds like that despite cross-filesystem copy is theoretically allowed, very few filesystems opt-in: https://elixir.bootlin.com/linux/v6.12.11/source/fs/read_write.c#L1520

@panjf2000
Copy link
Member

Huh, interesting. What is the common case where we expect copy_file_range to fail for regular files and thus fall back to sendfile?

  1. Kernel < 5.3
  2. Some special cases even when the kernel is 5.3+

As a matter of fact, copy_file_range(2) has a messy history: https://lwn.net/Articles/846403/. Even today, this system call is hardly perfect, thus it's pragmatic to have sendfile(2) as its fallback.

@prattmic
Copy link
Member

Given how close to the release we are, let's go with a revert to be safe: https://go.dev/cl/644895.

I'd appreciate a look. The revert was messy in the test files due to the refactoring done for the Solaris and FreeBSD equivalent CLs, and I didn't want to revert those as well since they seem to be fine.

The good news is that we plan to reopen the tree for 1.25 development this week or next week, so we can try again shortly.

@github-project-automation github-project-automation bot moved this from Todo to Done in Go Compiler / Runtime Jan 29, 2025
@prattmic
Copy link
Member

@kolyshkin The reproducer you provided is good with the revert. I'd appreciate if you could test runc itself at Go tip and make sure all seems well.

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 29, 2025
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Jan 31, 2025

@kolyshkin The reproducer you provided is good with the revert. I'd appreciate if you could test runc itself at Go tip and make sure all seems well.

@prattmic I've tested a recent git master (at commit 37f27fb) locally and it works fine, thanks!

EDIT: fix commit id

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/646415 mentions this issue: os: employ sendfile(2) only when the target is regular file on Linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
Development

No branches or pull requests

9 participants