-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Iterators efficiency and instance tracking #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Ivan, I agree that there is a lot of fluff. I'm curious if you've faced performance issues due to this? I'm not sure how the The untracked policies you suggest sound potentially dangerous (though maybe I'm just missing something). Python is free to do whatever it wants with the resulting instances (e.g. stick them into some data structure that persists). How can you ensure correct lifetime management while avoiding leaks in this case? |
@wjakob Yes, the only reason I raised this is because of serious performance issues. I have a C++ codebase that parses some network files and streams parsed data structures (order of tens of millions), with average time spent per packet being around 200ns. When wrapping it in a make_iterator (with the default reference_internal), the performance drops to around 550-600ns, when setting the rvp to copy it drops further to 750-800. All of this time is pybind11 overhead which in this case seems absolutely redundant. I did a bit of profiling, here's an example of top offenders (this is compiled with something like -O2 -pg): (pprof) top40
Total: 48298 samples
2815 5.8% 5.8% 2820 5.8% _int_malloc malloc.c:0
2480 5.1% 11.0% 2480 5.1% std::_Hash_bytes ??:0
2301 4.8% 15.7% 6467 13.4% std::_Hashtable::find :0
1750 3.6% 19.4% 1753 3.6% _int_free malloc.c:0
1549 3.2% 22.6% 1549 3.2% std::_Hashtable::_M_find_before_node [clone .isra.1086] [clone .lto_priv.2808] :0
1499 3.1% 25.7% 2519 5.2% std::_Hashtable::equal_range :0
1408 2.9% 28.6% 2115 4.4% _mcount ??:0
1369 2.8% 31.4% 2131 4.4% std::vector::vector :0
1215 2.5% 33.9% 1806 3.7% std::_Hashtable::_M_insert_multi_node :0
1119 2.3% 36.2% 3939 8.2% __GI___libc_malloc :0
1054 2.2% 38.4% 1054 2.2% std::_Hashtable::_M_find_before_node [clone .isra.1321] [clone .lto_priv.1320] :0
1020 2.1% 40.5% 1020 2.1% __memcpy_sse2 :0
1002 2.1% 42.6% 10985 22.7% StreamAdaptor::adaptor::fetch :0 # <-- C++ iterator "next()"
899 1.9% 44.5% 1322 2.7% std::_Hashtable::erase :0
829 1.7% 46.2% 33459 69.3% pybind11::cpp_function::initialize::{lambda#3}::operator [clone .constprop.848] :0
798 1.7% 47.8% 3763 7.8% cache_next_packet [clone .constprop.1239] :0 # C++ iterator
792 1.6% 49.5% 2804 5.8% __GI__IO_fread :0 So, out of the top 50% of cumulative time roughly 25% (half) is spent directly in hash maps, but actually it's more than that since a lot of mallocs, frees and memcpys are triggered by hash maps as well. Hash tables are fast, but not that fast... Multi maps are even worse. |
Re: the untracked data structures, we still attach a deallocator to them though (with the difference being that it wouldn't try to erase from the hash map)? Could you provide a hypothetical example of a leak you were talking about, so we're on the same page? |
Re: caching type_info, wouldn't it be natural for |
Here's a branch with (experimental) return type caching implementation: it passes all the tests, and is indeed being used as intended (it avoids the hash table type lookup for 62/140 return values in the test suite). See how it affects your profiling results; if it looks promising, I'll submit as a PR. It caches it in |
@jagerman Thanks. I did a bit of testing with a trivial struct with a single int field:
This is gcc5, -O3 -DNDEBUG, Python 3.5, Linux, E5 2687W, times per iteration:
=> pybind11 overhead goes from 225ns to 207ns which is approx 8%. So the proposed commit seems to work as intended (albeit it's slightly clunky that Should we also have |
This speedup seems very small compared to the profiler output mentioned earlier. How come? |
Because return value type lookup is the relatively smaller overhead here. Much of the rest of the hashing cost is in the instance lookup, which, for |
Also, just for the record, |
Yea, there's the instance tracking which takes time. Plus this a very simplified example (I'll have to try and rebuild the code I have with these changes to see how it affects real-world large codebase - but may not be able to do so until Monday). Here's a simple benchmark ("untracked" code can be commented out, it's my local change; adds "tracked" flag to instance wrapper, skips registering on cast and unregistering on dealloc): // test_cast.cpp
#include <pybind11/pybind11.h>
struct Foo { int v; Foo(int i) : v(i) {} };
PYBIND11_PLUGIN(test_cast) {
namespace py = pybind11;
py::module m("test_cast");
py::class_<Foo>(m, "Foo");
m.def("mk_copy", [](int i) -> Foo { return {i}; },
py::return_value_policy::copy);
m.def("mk_untracked", [](int i) -> Foo { return {i}; },
py::return_value_policy::copy_untracked);
m.def("ref", [](int N) {
for (int i = 0; i < N; i++) {
Foo f(i); auto f2 = new Foo(f); volatile int y = f2->v; delete f2;
}
});
return m.ptr();
} from time import time
from test_cast import mk_copy, mk_untracked, ref
def bench(msg, iterable, norm=1):
n = 0
t0 = time()
for _ in iterable:
n += 1
t1 = time()
n *= norm
print('{:<15s} calls: {:<10d} elapsed: {:.3f}s ns/call: {:.1f}ns'
.format(msg + ':', n, t1 - t0, 1e9 * (t1 - t0) / n))
N = 10 * 1000 * 1000
# object is created and copied in a C++ loop
bench('C++', (ref(x) for x in [N]), N)
# empty Python loop that allocates an object and does nothing else
bench('Python', (object() for i in range(N)))
# return_value_policy::copy_untracked
bench('mk_untracked', (mk_untracked(i) for i in range(N)))
# return_value_policy::copy
bench('mk_copy', (mk_copy(i) for i in range(N))) Using @jagerman's branch: C++: calls: 10000000 elapsed: 0.279s ns/call: 27.9ns
Python: calls: 10000000 elapsed: 1.587s ns/call: 158.7ns
mk_untracked: calls: 10000000 elapsed: 2.573s ns/call: 257.3ns
mk_copy: calls: 10000000 elapsed: 3.298s ns/call: 329.8ns On master: mk_copy: calls: 10000000 elapsed: 3.496s ns/call: 349.6ns |
@aldanor Re: |
Okay, there's my answer :) |
@jagerman Yes that exactly what I did :) |
It would be interesting to see a profiler dump for the |
Something else to think about: would it make more sense applying this do-no-track attribute to the type, rather than the return value? Otherwise I could get into weird behaviour with something like:
where If, on the other hand, the don't-track-me is a class property, I'll always get consistent behaviour: every function that returns a |
@jagerman I think this would throw a cast exception though (correct me if I'm wrong): if |
I don't think so: |
Oh, right. Well, then we can make it fail? If we have a valid instance, and it's not |
Is there any reason you shouldn't be able to pass these instances to C++ functions (perhaps there's some other overhead I'm not thinking of)? I'd be a bit uncomfortable with the resulting situation if you couldn't: if the python object was created this way, you can't pass it to a function, but if it was created that way, you can. On the other hand, if the non-copyability applies to all instances of a particular type, the behaviour of that instance is determined by what the instance is, rather than where it came from. Even if there is a compelling reason that these untracked instances shouldn't be passable to C++ functions, it still seems better that no python |
@jagerman No other reason, but to me personally it feels like this is actually a return value policy and not a property of a type (e.g. you can have some functions returning untracked instances of a type, but not other ones?). Need to ponder more on this :) Just out of curiosity (it should probably be done a bit more carefully), I've implemented a simple proof-of-concept example which rejects reference casts for untracked instances, but accepts them by value: branch / test example. |
This is a pretty harsh restriction: it means I can't pass a python instance to a function that takes a To summarize, just to be sure I understand the issues correctly (correct me if I have anything wrong here), the main issue here is that the following will result in a double-free: m.def("get", [&]() { return new SomeType(); }, py::return_value_policy::copy_untracked);
m.def("pass_through", [&](SomeType &s) { return &s; }) a = get()
b = pass_through(a) because the Does that sound about right? |
I don't quite understand why, but this breaks a lot of tests (tests are fine with with the untracked commit, but not with the restricting one). |
@jagerman Thanks for taking a look. Yea, as I said, the second commit was just playing around with it -- it could be done more carefully if need be (one of the problems being that by the time it gets to Looking at your last example -- yea, that's the kind of thing that should be prevented with untracked instances this way or another. And preventing passing the untracked instances as reference arguments may indeed not be the best way to go (although I'm certain it's feasible). I'm wondering, would your previous suggestion on attaching the "untrackedness" property to the type itself solve this? What kind of limitations would be imposed on such types? |
I think so. Basically, if the type is declared untrackable, everything in the many-argumented |
Yea, I see. So for simple/small POD types (like one in the example) this would make sense, you pay the price of copying (which could be negligible) but avoid hashmap lookup, insert and erase (and with the proposed change in your branch, avoid any hashmap operations at all beyond the first call). So when it's returned from C++, it's force-copied. But if I understand correctly this does not prohibit it crossing the boundary from Python to C++ as e.g. a mutable lvalue reference? This is good then. In fact, for near-pointer-sized types, this would often be pretty much a free win. |
I think that's the main use case here. If you're writing code that needs to iterate through many large objects, you'll be much better off with a continually updated reference, like in your original |
@aldanor take a look at my only-copy-types branch: it's an implementation of the type-based approach. I adapted your simple benchmark code as follows: #include <pybind11/pybind11.h>
struct Foo { int v; Foo(int i) : v(i) {} };
struct Bar { int v; Bar(int i) : v(i) {} };
PYBIND11_PLUGIN(test_cast) {
namespace py = pybind11;
py::module m("test_cast");
py::class_<Foo>(m, "Foo", py::always_copy());
py::class_<Bar>(m, "Bar");
m.def("mk_copy", [](int i) -> Bar { return {i}; },
py::return_value_policy::copy);
m.def("mk_untracked", [](int i) -> Foo { return {i}; });
m.def("py_ref", [](int i) { static Bar b(0); b.v = i; return &b; },
py::return_value_policy::reference);
m.def("ref", [](int N) {
for (int i = 0; i < N; i++) {
Foo f(i); auto f2 = new Foo(f); volatile int y = f2->v; delete f2;
}
});
return m.ptr();
} test_cast.py: from time import time
from test_cast import mk_copy, mk_untracked, py_ref, ref
def bench(msg, iterable, norm=1):
n = 0
t0 = time()
for _ in iterable:
n += 1
t1 = time()
n *= norm
print('{:<15s} calls: {:<10d} elapsed: {:.3f}s ns/call: {:.1f}ns'
.format(msg + ':', n, t1 - t0, 1e9 * (t1 - t0) / n))
N = 20 * 1000 * 1000
# object is created and copied in a C++ loop
bench('C++', (ref(x) for x in [10*N]), 10*N)
# empty Python loop that allocates an object and does nothing else
bench('Python', (object() for i in range(N)))
# returns an always_copy type
bench('mk_untracked', (mk_untracked(i) for i in range(N)))
# rvp::reference
bench('py_ref', (py_ref(i) for i in range(N)))
# return_value_policy::copy
bench('mk_copy', (mk_copy(i) for i in range(N))) which gives the following timings:
or, applied to master (i.e. without the return value type caching):
|
I think it's an ok compromise if you do want/need to make a copy. For user-facing API, I find having an iterator with internal reference type may be quite confusing, since you will essentially get the same object in Python, and if you start collecting yielded items in a collection the results may be unexpected. Also, if you want absolutely fastest version with internal reference iterator, you can slightly modify On my home OS X box, I get slower timings but same ordering:
What I find a bit strange is that 208 + 50.8 < 294.6, since copying an int-size struct shouldn't take twice as much time than inserting+erasing from a hashmap? Maybe the compiler optimizes something in an unobvious way, or maybe one version is less cache friendly, not sure about that. // As Wenzel mentioned above, would be nice to pprof this, also to see where the rest of the timing goes in pybind11 (unfortunately, pprof/gperftools doesn't resolve all symbols on osx so if someone on Linux could check it that'd be great; otherwise I'll try to check it next week myself). |
It has to do more than just that: there's also the python instance initialisation and deallocation, which isn't needed when the registered instance already exists. |
With all the hashtable stuff gone with always_copy, there's really not much left in terms of obvious pybind11 overhead. |
Thanks for the profiles, that's what I would expect. I was thinking, would this have to be restricted to copying? Does |
I think move support would have to come in via the return_value_policy: so if the object is |
Yes, good point. |
Agreed, but |
So, everything looks pretty good, are we going ahead with all this then? |
I vote for I'd sort of like to wait for #385 (which looks like it will depend on #372 for the py::class_<Name, py::no_reference>(m, "Name") which feels like a more natural declaration to me than: py::class_<Name>(m, "Name", py::no_reference()) (mainly because it keeps the tag with the C++ class). |
I'm a bit lost in the discussion above which considered a variety of possible changes. What is the specific proposal that is under consideration now? Is there candidate code to look at? |
There will be shortly; I have to get #385 updated first as this now depends on that. |
I've updated my only-copy-types branch--it now supports It doesn't currently documentation or test code committed; these are the current versions of test/benchmark code I've been using (which also show how it works): test_cast.cpp #include <pybind11/pybind11.h>
#include <pybind11/functional.h>
struct Foo { int v; Foo(int i) : v(i) {} };
struct Bar { int v; Bar(int i) : v(i) {} };
PYBIND11_PLUGIN(test_cast) {
namespace py = pybind11;
py::module m("test_cast");
py::class_<Foo, py::no_reference>(m, "Foo");
py::class_<Bar>(m, "Bar");
m.def("mk_copy", [](int i) -> Bar { return {i}; },
py::return_value_policy::copy);
m.def("mk_untracked", [](int i) -> Foo { return {i}; });
m.def("py_ref", [](int i) { static Bar b(0); b.v = i; return &b; }, py::return_value_policy::reference);
m.def("ref", [](int N) {
for (int i = 0; i < N; i++) {
Foo f(i); auto f2 = new Foo(f); volatile int y = f2->v; delete f2;
}
});
m.def("ref_lambda", [](int N, py::object o) {
for (int i = 0; i < N; i++) {
o(Foo(i));
}
});
return m.ptr();
} test_cast.py from time import time
from test_cast import mk_copy, mk_untracked, py_ref, ref, ref_lambda
def bench(msg, iterable, norm=1):
n = 0
t0 = time()
for _ in iterable:
n += 1
t1 = time()
n *= norm
print('{:<15s} calls: {:<10d} elapsed: {:.3f}s ns/call: {:.1f}ns'
.format(msg + ':', n, t1 - t0, 1e9 * (t1 - t0) / n))
N = 20 * 1000 * 1000
# object is created and copied in a C++ loop
bench('C++', (ref(x) for x in [10*N]), 10*N)
# empty Python loop that allocates an object and does nothing else
bench('Python', (object() for i in range(N)))
# returns an always_copy type
bench('mk_untracked', (mk_untracked(i) for i in range(N)))
# rvp::reference
bench('py_ref', (py_ref(i) for i in range(N)))
# Loop in C++ calling python lambda
bench('C++ w/ lambda', (ref_lambda(x, lambda z: z) for x in [N]), N)
# return_value_policy::copy
bench('mk_copy', (mk_copy(i) for i in range(N))) |
There are really two separate issues here: "Iterators with internal reference" and "No internal reference or copy r/v policy"; #390 addresses the first one, while only-copy-types is a potential solution to the second. |
@aldanor Just for curiousity, I added a "C++ w/ lambda" case into the test above: it manages to perform slightly better than either the reference/copy approaches.
|
Re: C++ w/ lambda, I guess it's partly because the loop is now in C++ so Python doesn't have to read/write from the locals dict a gazillion times? That, plus no hashmap lookups. Quite surprising it's so fast nonetheless, this is really good stuff :) |
@jagerman Alright no worries, I'll try to finish it up, I think it's worth doing, especially if return type cache PR lands as well. |
The return value caching in PR #390 was retracted in favour of including it as part of the eventual PR to address both issues here at once. |
In the current implementation, the iterator implementation turns out to be very inefficient at times (to the point where it adds overhead that is order of magnitude bigger than the iterator logic itself).
Iterators with internal reference
Consider a hypothetical example (this would be quite typical for stream-like / input ranges):
Assuming
point
has a correspondingpy::class_
binding, wrapping this range in apy::make_iterator()
with default return value policy ofreference_internal
and iterating over it in Python will do roughly the following for each iteration step:point
instance:type_info
(why isn't this cached per eachpy::cpp_function
?)Note that this will essentially yield the same Python object on every iteration, but it will still perform two map lookups every time.
In this example, this could be completely avoided if
py::iterator_state
cached the resultingpy::object
(whose->value
always points to the same C++ instance in this example, so it doesn't even need to be reallocated) and not just the current state of begin/end iterators. Instead of doing the cast, it could just incref the object and return the handle.No internal reference or copy r/v policy
If the C++ iterator returns an object with a different address on each dereference, or if we specify the copy return value policy, things get even worse.
The sequence of steps is now:
point
instance:type_info
If the downstream Python code doesn't care about yielded values outside of one iteration
cycle and doesn't pass them as arguments to other functions, which is quite often the case,
i.e. if it's something like this:
or this:
then registering/unregistering instances adds overhead that is completely unneeded. Plus, between the garbage collection cycles, the registered instances multimap will grow quite fast here which will slow things down even further. Here, the sequence of steps might as well just be:
Would it make sense to have a return value policy
copy_untracked
(cast doesn't do registered instance lookup; dealloc calls dtor and doesn't try to unregister)? Orreference_untracked
(cast doesn't do registered instance lookup; dealloc doesn't call dtor and doesn't try to unregister) or something like that? The only catch here is that some sort of flag must be stored in the instance itself, so that the deallocator knows not to try and erase it from registered instances multimap when the time comes.The text was updated successfully, but these errors were encountered: