Skip to content

get_type_overload may encounter false cache hits if derived instances are GC'd #1922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
EricCousineau-TRI opened this issue Sep 12, 2019 · 6 comments

Comments

@EricCousineau-TRI
Copy link
Collaborator

EricCousineau-TRI commented Sep 12, 2019

Issue description

This is the upstream form of RobotLocomotion/drake#11424 that @BetsyMcPhail root-caused in this discussion: RobotLocomotion/drake#11424 (comment)

High-level overview and proposed solution

Pybind11 caches functions that aren't overloaded in Python to avoid costly Python dictionary lookups.

The non-overload cache is a set which uses (type.ptr, function name) as a key. Once an item is added to the cache, it is never removed.

The bug occurs when a type that has been added to the cache (e.g type = <class 'custom_test.TestCustom.test_deprecated_protected_aliases..OldSystem'>
type.ptr() = 0x2746c58) is destroyed and a new type is created with the exact same pointer (e.g. type = <class 'custom_test.TestCustom.test_leaf_system_overrides..TrivialSystem'>, type.ptr() = 0x2746c58) that should not be in the cache.

The first attempt at fixing the issue was to use self.ptr instead of type.ptr in the cache key. Testing demonstrated that this was not enough to solve the issue.

In the end, removing all associated cached items when an instance is deregistered seems to fix the issue.

She submitted a potential fix for it in our fork here:
RobotLocomotion#32

Reproducible example code

I've done a quick minimal repro based on master @ 12e8774 here:
EricCousineau-TRI@9726086

Extra keyword: garbage collect virtual inheritance polymorphism

@BetsyMcPhail
Copy link

cc @mathstuf

@wjakob
Copy link
Member

wjakob commented Sep 19, 2019

Hi @EricCousineau-TRI,

oh oh, that sounds like it was very painful to track down :(. That's a very subtle issue indeed and something that was not considered during development. Unfortunately haven't bee able to come up with nice way of resolving the problem.

One option of course would be to remove the caching mechanism. That's a no-go because it will route every virtual function call through Python (& the GIL). Some C++ project issue vast amounts of these, potentially in parallel from multiple threads. Such a change would essentially break these projects.

Ideally we'd be able to run a piece of code when a Python type is GCed (which extends a pybind11 type that involves trampolines). Then we could just clear the cache at that point, or a portion of the cache with entries related to that type. I was thinking that the right place to do do this is to override the __del__ functor of pybind11_object, which is the metaclass of pybind11 classes.

But I can't seem to catch any of these events (I tried overriding tp_del and tp_finalize):

diff --git a/include/pybind11/detail/class.h b/include/pybind11/detail/class.h
index ffdfefe..436135c 100644
--- a/include/pybind11/detail/class.h
+++ b/include/pybind11/detail/class.h
@@ -156,6 +156,10 @@ extern "C" inline PyObject *pybind11_meta_getattro(PyObject *obj, PyObject *name
 }
 #endif
 
+extern "C" inline void pybind11_finalize(PyObject* obj) {
+    printf("Finalizing %p\n", obj);
+}
+
 /** This metaclass is assigned by default to all pybind11 types and is required in order
     for static properties to function correctly. Users may override this using `py::metaclass`.
     Return value: New reference. */
@@ -186,6 +190,9 @@ inline PyTypeObject* make_default_metaclass() {
     type->tp_getattro = pybind11_meta_getattro;
 #endif
 
+    type->tp_finalize = pybind11_finalize;
+    type->tp_del = pybind11_finalize;
+
     if (PyType_Ready(type) < 0)
         pybind11_fail("make_default_metaclass(): failure in PyType_Ready()!");

The above patch should cause some text to be printed when your ExampleVirt2-derived instance is GCed, but it doesn't work at least on my machine.

Unfortunately I've already spent quite a bit of time debugging this and I probably won't be able to dedicate more cycles any time soon (upcoming grant deadline..)

Best,
Wenzel

@wjakob
Copy link
Member

wjakob commented Sep 19, 2019

It's not 100% clear to me what is going on in #1923, but the solution will probably require similar ingredients to the above.

@EricCousineau-TRI
Copy link
Collaborator Author

Thank you for checking on this!
Our attempted fix seemed to break things (comment), so we may end up visiting this solution if we can figure out how to correctly trigger the __del__ equivalent on the type like you started doing.

@BetsyMcPhail
Copy link

A different approach to fixing this has been submitted to the drake fork. See RobotLocomotion#37

@EricCousineau-TRI
Copy link
Collaborator Author

As @YannickJadoul suggested and I as I tested, this reproduction case (EricCousineau-TRI/pybind11@9726086) was fixed by #2564

For more details, see: #2772 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants