reattach: don't kill process on failed reconnection #320

tgross · 2024-10-18T20:08:14Z

During reattachment, we look to see if the process corresponding to the stored PID is running. If so, we try to connect to that process. If that fails, we kill the process under the presumption it's not working, and return ErrProcessNotFound.

But during reattachment we don't know that the PID we have is still valid. Which means that the process we're trying to attach to may have exited and a different process has spawned with the same PID. This results in some unrelated process getting silently killed.

This impacts Nomad when running the rawexec or exec task drivers, because the Nomad agent spawns an "executor" process via go-plugin to control the workloads, and these executors are left running when Nomad exits. If the executors die in the meantime (or the host is rebooted), then we can potentially kill a random process on the host.

Because there's no way for go-plugin to know whether the process is a go-plugin server without connecting, this kill is never really safe. Remove it.

Ref: hashicorp/nomad#23969
Ref: https://hashicorp.atlassian.net/browse/NET-11233

During reattachment, we look to see if the process corresponding to the stored PID is running. If so, we try to connect to that process. If that fails, we kill the process under the presumption it's not working, and return ErrProcessNotFound. But during reattachment we don't know that the PID we have is still valid. Which means that the process we're trying to attach to may have exited and a different process has spawned with the same PID. This results in some unrelated process getting silently killed. This impacts Nomad when running the `rawexec` or `exec` task drivers, because the Nomad agent spawns an "executor" process via go-plugin to control the workloads, and these executors are left running when Nomad exits. If the executors die in the meantime (or the host is rebooted), then we can potentially kill a random process on the host. Because there's no way for go-plugin to know whether the process is a go-plugin server without connecting, this kill is never really safe. Remove it. Ref: hashicorp/nomad#23969 Ref: https://hashicorp.atlassian.net/browse/NET-11233

tgross · 2024-10-21T13:32:35Z

Rebased on main to pick up your CI changes.

peter-harmann-tfs · 2024-10-22T12:15:33Z

@tgross Is there ever a chance the process really is a nomad executor that is not working? Could this cause these processes to leak and keep running forever?

Because there's no way for go-plugin to know whether the process is a go-plugin server without connecting, this kill is never really safe.

Storing also the process creation time and killing only if it matches should solve this, as PID + Process creation time together should be unique.

tgross · 2024-10-22T12:41:24Z

Is there ever a chance the process really is a nomad executor that is not working? Could this cause these processes to leak and keep running forever?

@peter-harmann-tfs we had an internal chat about that case and the consensus was that if there was a bug such that the plugin process was live but not able to allow connections, you'd want to know that so it can be fixed rather than silently kill the process and have that bug blow up in your face later.

tgross mentioned this pull request Oct 18, 2024

process not managed by Nomad killed on restart hashicorp/nomad#23969

Closed

tgross added the bug label Oct 18, 2024

tgross requested review from gulducat, schmichael and shoenig October 18, 2024 20:12

schmichael approved these changes Oct 18, 2024 •

edited

Loading

View reviewed changes

tgross force-pushed the no-kill-on-failed-reconnect branch from 8bcfca2 to c1eccbd Compare October 21, 2024 13:32

shoenig approved these changes Oct 21, 2024

View reviewed changes

tgross merged commit df94fce into main Oct 21, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

reattach: don't kill process on failed reconnection #320

reattach: don't kill process on failed reconnection #320

Uh oh!

tgross commented Oct 18, 2024

Uh oh!

tgross commented Oct 21, 2024

Uh oh!

Uh oh!

peter-harmann-tfs commented Oct 22, 2024 •

edited

Loading

Uh oh!

tgross commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

reattach: don't kill process on failed reconnection #320

reattach: don't kill process on failed reconnection #320

Uh oh!

Conversation

tgross commented Oct 18, 2024

Uh oh!

Uh oh!

tgross commented Oct 21, 2024

Uh oh!

Uh oh!

peter-harmann-tfs commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgross commented Oct 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

peter-harmann-tfs commented Oct 22, 2024 •

edited

Loading