Skip to content

Fix #4459 (cpp atexit callbacks) without segfault (#4500) #4505

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions include/pybind11/detail/internals.h
Original file line number Diff line number Diff line change
Expand Up @@ -535,7 +535,11 @@ struct local_internals {
loader_life_support_tls_key
= static_cast<shared_loader_life_support_data *>(ptr)->loader_life_support_tls_key;
}
#else
local_internals() = default;
#endif // defined(WITH_THREAD) && PYBIND11_INTERNALS_VERSION == 4
local_internals(const local_internals &) = delete;
local_internals &operator=(const local_internals &) = delete;
};

/// Works like `get_internals`, but for things which are locally registered.
Expand Down
20 changes: 16 additions & 4 deletions include/pybind11/embed.h
Original file line number Diff line number Diff line change
Expand Up @@ -254,13 +254,25 @@ inline void finalize_interpreter() {
if (builtins.contains(id) && isinstance<capsule>(builtins[id])) {
internals_ptr_ptr = capsule(builtins[id]);
}
// Local internals contains data managed by the current interpreter, so we must clear them to
// avoid undefined behaviors when initializing another interpreter
detail::get_local_internals().registered_types_cpp.clear();
detail::get_local_internals().registered_exception_translators.clear();

// We must clear this data after the interpreter is finalized but calling get_local_internals()
// is UB at that point, so he have to cache a references here.
auto &local_internals = detail::get_local_internals();

// Hack to materialize local internals from static method variable before Py_Finalize.
bool need_to_clear = !local_internals.registered_exception_translators.empty()
|| !local_internals.registered_exception_translators.empty();

Py_Finalize();

// This local data is needed during Py_Finalize() for callbacks or hooks such as atexit.
// Local internals contains data managed by the current interpreter, so we must clear them to
// avoid undefined behaviors when initializing another interpreter
if (need_to_clear) {
local_internals.registered_types_cpp.clear();
local_internals.registered_exception_translators.clear();
}

if (internals_ptr_ptr) {
delete *internals_ptr_ptr;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rwgk It just occurred to me, that the reason deleting the internals_ptr_ptr here works is that it effectively leaks the containers in internals since it just deallocates them without calling the dtors. This shouldn't be working here because if the dtors did run, then the capsule dtor for builtins object would be called which requires the CPython API which would segfault. What this is actually doing is just leaking the objects. If we also changed the behavior to improperly swap out pointers to the container, that would work, but it's not a longterm fix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

effectively leaks the containers in internals since it just deallocates them without calling the dtors

Yes, I was aware of that. Just to confirm.

Some background that may or may not matter here, sharing JIC:

Until Python 3.9 or 3.10 (not sure anymore) Python itself was leaking (https://github.com/rwgk/stuff/blob/f6549ecadb46169feb5f3c2456dcd13852382f7a/noddy/embedded_noddy_main_run.c), but with one of those versions that was cleaned up.

Before the cleanup in Python I was thinking: well, there is no point in being clean in pybind11. Therefore I wasn't concerned about any internals leaks.

That thinking has changed now. But who has the free energy to cleanup pybind11? And is it even doable without an ABI break? I don't know.

To come back to the problem here: I wouldn't worry about leaks, still. We "just" need to satisfy the situations of #3744 and #4500 simultaneously, with or without leaks.

Then worry about the leaks later.

Concretely: could we simply replicate the ptr_ptr tricked used by (non-local) internals? I.e. leak the local internals, too?

Copy link
Collaborator Author

@Skylion007 Skylion007 Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will need an ABI break at some point.

Concretely: could we simply replicate the ptr_ptr tricked used by (non-local) internals? I.e. leak the local internals, too?

Probably? wouldn't that be an ABI break in itself though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably? wouldn't that be an ABI break in itself though?

I don't think so.

@lalaland for a 2nd opinion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change local internals without an ABI break?

I think the bigger issue is that it might be an API break, since adding leaking behavior is explicitly changing the behavior of existing code. And in memory tight situations, that can be quite bad.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bigger issue is that it might be an API break, since adding leaking behavior is explicitly changing the behavior of existing code. And in memory tight situations, that can be quite bad.

Absolutely, it's a rock-and-a-hard place situation.

Ideally someone rolls up their sleeves and wrestles it down properly with whatever it takes.

I cannot do that: embedding has no critical use at Google. The only reason I'm looking is general pybind11 health.

So this here boils down to: what do we want to do given very limited resources?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for leaking now to fix the segfaults as a quick patch, but actually try to think of a way to eliminate the leak as a more long term project.

There has got to be some way of fixing this in the long term.

*internals_ptr_ptr = nullptr;
Expand Down
3 changes: 2 additions & 1 deletion tests/test_embed/catch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,8 @@ int main(int argc, char *argv[]) {
#else
setenv("PYTHONPATH", updated_pythonpath.c_str(), /*replace=*/1);
#endif

// Below is a test case for a potential segfault. See PR #4500.
{ py::scoped_interpreter guard; }
py::scoped_interpreter guard{};

auto result = Catch::Session().run(argc, argv);
Expand Down