Skip to content

Conversation

@tgross
Copy link
Member

@tgross tgross commented Oct 18, 2024

During reattachment, we look to see if the process corresponding to the stored PID is running. If so, we try to connect to that process. If that fails, we kill the process under the presumption it's not working, and return ErrProcessNotFound.

But during reattachment we don't know that the PID we have is still valid. Which means that the process we're trying to attach to may have exited and a different process has spawned with the same PID. This results in some unrelated process getting silently killed.

This impacts Nomad when running the rawexec or exec task drivers, because the Nomad agent spawns an "executor" process via go-plugin to control the workloads, and these executors are left running when Nomad exits. If the executors die in the meantime (or the host is rebooted), then we can potentially kill a random process on the host.

Because there's no way for go-plugin to know whether the process is a go-plugin server without connecting, this kill is never really safe. Remove it.

Ref: hashicorp/nomad#23969
Ref: https://hashicorp.atlassian.net/browse/NET-11233

schmichael
schmichael approved these changes Oct 18, 2024
During reattachment, we look to see if the process corresponding to the stored
PID is running. If so, we try to connect to that process. If that fails, we kill
the process under the presumption it's not working, and return
ErrProcessNotFound.

But during reattachment we don't know that the PID we have is still valid. Which
means that the process we're trying to attach to may have exited and a different
process has spawned with the same PID. This results in some unrelated process
getting silently killed.

This impacts Nomad when running the `rawexec` or `exec` task drivers, because
the Nomad agent spawns an "executor" process via go-plugin to control the
workloads, and these executors are left running when Nomad exits. If the
executors die in the meantime (or the host is rebooted), then we can potentially
kill a random process on the host.

Because there's no way for go-plugin to know whether the process is a go-plugin
server without connecting, this kill is never really safe. Remove it.

Ref: hashicorp/nomad#23969
Ref: https://hashicorp.atlassian.net/browse/NET-11233
@tgross tgross force-pushed the no-kill-on-failed-reconnect branch from 8bcfca2 to c1eccbd Compare October 21, 2024 13:32
@tgross
Copy link
Member Author

tgross commented Oct 21, 2024

Rebased on main to pick up your CI changes.

@tgross tgross merged commit df94fce into main Oct 21, 2024
3 checks passed
@peter-harmann-tfs
Copy link

peter-harmann-tfs commented Oct 22, 2024

@tgross Is there ever a chance the process really is a nomad executor that is not working? Could this cause these processes to leak and keep running forever?

Because there's no way for go-plugin to know whether the process is a go-plugin server without connecting, this kill is never really safe.

Storing also the process creation time and killing only if it matches should solve this, as PID + Process creation time together should be unique.

@tgross
Copy link
Member Author

tgross commented Oct 22, 2024

Is there ever a chance the process really is a nomad executor that is not working? Could this cause these processes to leak and keep running forever?

@peter-harmann-tfs we had an internal chat about that case and the consensus was that if there was a bug such that the plugin process was live but not able to allow connections, you'd want to know that so it can be fixed rather than silently kill the process and have that bug blow up in your face later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants