Skip to content

Foundation.Process on Linux doesn't correctly detect when child process dies (creating zombie processes) #4795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
weissi opened this issue Jul 12, 2023 · 2 comments
Assignees

Comments

@weissi
Copy link
Contributor

weissi commented Jul 12, 2023

Description

Foundation.Process on Linux uses a trick (that doesn't actually work...) to detect if the child process has exited: It inherits a socketpair descriptor into the child and it expects this socket to be closed when the child exits. In simple scenarios that is true but UNIX by default inherits all file descriptors into child processes. That means if the sub process itself spawns another process, the special socket will be inherited into the child.

That's a huge issue however because now the parent process will no longer detect if the child is dying because the child's child also has that file descriptor...

Attached, please find a reproduction which does the following:

The parent process spawns a /bin/sh as its child process. That child process spawns another process (childs child) which does sleep 12345678 which is a very very long sleep. After one second, parent kills child with SIGKILL which means that child now immediately exits. Then, the parent calls process.waitUntilExit() which should immediately return (because the child is dead). Alas, Foundation.Process does not realise that child is dead because that special socketpair is also inherited into childs child (and further sub processes)...

Expected behaviour (observed on Darwin)

$ swift /tmp/process_bug_repro.swift
[in       parent: 11427] start subprocess 'child'
[in       parent: 11427] waiting 1 second (for child with pid 11428)
[in        child: 11428] start subprocess 'childs child'
[in        child: 11428] waiting for childs child (with pid 11429)
[in childs child: 11429] start
[in       parent: 11427] kill SIGKILL child with pid 11428)
[in       parent: 11427] kill successful
[in       parent: 11427] waiting for child with pid 11428 to exit
[in       parent: 11427] done

Actual behaviour (observed on Linux, Swift 5.8)

[in       parent: 13] start subprocess 'child'
[in       parent: 13] waiting 1 second (for child with pid 35)
[in        child: 35] start subprocess 'childs child'
[in        child: 35] waiting for childs child (with pid 36)
[in childs child: 36] start
[in       parent: 13] kill SIGKILL child with pid 35)
[in       parent: 13] kill successful
[in       parent: 13] waiting for child with pid 35 to exit
[in       parent: 13] WEIRD (THIS IS THE BUG), still waiting at 2023-07-12 14:23:08 +0000. Running ps uw -p 13 -p 35 -p 36
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        13 11.3  4.3 591104 175580 pts/0   Sl+  14:23   0:00 /usr/bin/swift-frontend -frontend -interpret process_bug_repro.swif
root        35  0.0  0.0      0     0 pts/0    Z    14:23   0:00 [sh] <defunct>      <<--- JW: THIS IS THE CHILD THAT's a zombie now
root        36  0.0  0.0   2308   832 pts/0    S    14:23   0:00 /bin/sh -c echo "[in childs child: $$] start"; sleep 12345678; echo
[in       parent: 13] WEIRD (THIS IS THE BUG), still waiting at 2023-07-12 14:23:13 +0000. Running ps uw -p 13 -p 35 -p 36
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root        13  8.3  4.3 664932 175584 pts/0   Sl+  14:23   0:00 /usr/bin/swift-frontend -frontend -interpret process_bug_repro.swift -Xllvm -aarch64-use-tbi -disable-objc-interop
root        35  0.0  0.0      0     0 pts/0    Z    14:23   0:00 [sh] <defunct>
root        36  0.0  0.0   2308   832 pts/0    S    14:23   0:00 /bin/sh -c echo "[in childs child: $$] start"; sleep 12345678; echo "[in childs child: $$] done"
[...] output continues "forever"

Fix

Instead of using this special socketpair which has two issues:

  1. As demonstrated above, this can lead to false negatives (because fd gets inherited further)
  2. This can also lead to false positives (because the child process could close all its file descriptors making Foundation.Process think that the child has exited when it hasn't)

To fix both of these, Foundation.Process should either use pidfd_open or signalfd on SIGCHLD to get an epollable signal when the child process dies.

@weissi
Copy link
Contributor Author

weissi commented Jul 12, 2023

  • single file repro: process_bug_repro.swift ( in zip for GitHub) process_bug_repro.zip
  • for Apple folk: rdar://112137804
import Foundation
import Dispatch

func makePSLoop(interestingPids: [CInt]) -> DispatchSourceTimer {
    let q = DispatchQueue(label: "offload")
    let timer = DispatchSource.makeTimerSource(queue: q)
    timer.setEventHandler {
        let p = Process()
        p.executableURL = URL(fileURLWithPath: "/bin/ps")
        let args = ["uw"] + interestingPids.flatMap { ["-p", "\($0)" ] }
        print("[in       parent: \(getpid())] WEIRD (THIS IS THE BUG), still waiting at \(Date()). Running ps \(args.joined(separator: " "))")
        p.arguments = args
        try? p.run()
        p.waitUntilExit()
    }
    timer.schedule(deadline: .now() + 5, repeating: 5)
    return timer
}

let p = Process()
p.executableURL = URL(fileURLWithPath: "/bin/sh")
p.arguments = [
    "-c",
    """
    echo "[in        child: $$] start subprocess 'childs child'"
    /bin/sh -c 'echo "[in childs child: $$] start"; sleep 12345678; echo "[in childs child: $$] done"' &
    child_child_pid=$!
    echo "[in        child: $$] waiting for childs child (with pid $child_child_pid)"
    wait
    echo "[in        child: $$] done"
    """
]
print("[in       parent: \(getpid())] start subprocess 'child'")
fflush(stdout)
try p.run()
print("[in       parent: \(getpid())] waiting 1 second (for child with pid \(p.processIdentifier))")
fflush(stdout)
sleep(1)
print("[in       parent: \(getpid())] kill SIGKILL child with pid \(p.processIdentifier))")
let err = kill(p.processIdentifier, SIGKILL)
print("[in       parent: \(getpid())] kill \(err == 0 ? "successful" : "failed (\(errno))")")
print("[in       parent: \(getpid())] waiting for child with pid \(p.processIdentifier) to exit")
fflush(stdout)

let printPSLoop = makePSLoop(interestingPids: [getpid(), p.processIdentifier, p.processIdentifier + 1])
printPSLoop.resume()
p.waitUntilExit()
print("[in       parent: \(getpid())] done")
fflush(stdout)
printPSLoop.cancel()

@weissi
Copy link
Contributor Author

weissi commented Sep 18, 2024

Still happens in 6.0 with swift-foundation.

@iCharlesHu iCharlesHu self-assigned this Sep 18, 2024
MaxDesiatov added a commit to swiftlang/swift-sdk-generator that referenced this issue Nov 27, 2024
Namely, this works around following issues in `Foundation.Process`:
 - "Foundation.Process on Linux throws error Error Domain=NSCocoaErrorDomain Code=256 "(null)" if executable not found"
   swiftlang/swift-corelibs-foundation#4810
 - "Foundation.Process on Linux doesn't correctly detect when child process dies (creating zombie processes)"
   swiftlang/swift-corelibs-foundation#4795
 - "Foundation.Process on Linux seems to inherit the Process.run()-calling thread's signal mask, even SIGTERM blocked"
   swiftlang/swift-corelibs-foundation#4772
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants