-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Support for sharing dtypes across extensions + public shared data API #472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A few minor outstanding questions:
|
e2628b8
to
ee45775
Compare
Good point about The purpose of NOINLINE is to prohibit any kind of inlining for medium-sized functions that are called from templated code, or simply from many different places. (for trivial ones we don't care, and large-sized functions wouldn't actually get inlined). |
PS: It would be nice if you could add a couple of lines about |
(whoops) |
Re: noinline, wouldn't this be better (as in avoiding a non-inlined function call on all invocations except the very first one)? PYBIND11_NOINLINE inline internals &load_internals() {
// capsule loading etc
}
inline internals &get_internals() {
static internals *internals_ptr = nullptr;
if (internals_ptr)
return *internals_ptr;
internals_ptr = load_internals();
return *internals_ptr;
} |
(typo in the example above, fixed) |
Take a look at this in a disassembler, you'll be surprised :) -- static local variables are tricky, and they generate a fair bit of code (much more than a function call). |
Give it a try and see how much it enlarges the .so. It's probably enough of an increase that it isn't worthwhile, but would be nice to see anyway. |
Curiously, I see a slight decrease in total .so size when doing @aldanor's suggestion--though to make it work, I changed it like so: PYBIND11_NOINLINE inline void load_internals(internals *&internals_ptr) {
...
}
__attribute__((always_inline)) inline internals &get_internals() {
static internals *internals_ptr = nullptr;
if (!internals_ptr) load_internals(internals_ptr);
return *internals_ptr;
} Without the inline-forcing attribute (or with inline forced off using PYBIND11_NOINLINE) I get an .so size of 985816; with inline forced, I get a slight .so drop to 985688. With clang++-3.8 on debian, I see no difference in .so at all. |
That's curious indeed :) Btw I think you could simplify it even a bit further without an if check, no? PYBIND11_NOINLINE internals* void load_internals() {
...
}
__attribute__((always_inline)) inline internals &get_internals() {
static auto ptr = load_internals();
return *ptr;
} |
That way slightly reduces the non-inlined case to 985624, but makes the inlined case considerably larger, at 989784. (And just for reference, the current master, single-function .so size is 985528). |
ee45775
to
0a53644
Compare
Added an example in the docs for the Re: |
Initializing a static variable to a compile-time constant eliminates most of the static machinery (guard variable + thread safe initialization), which explains the size differences between the two implementations above. See the generated assembly for As for splitting up |
Out of sheer curiosity, I tried running something like this: #include <chrono>
#include "include/pybind11/pybind11.h"
using namespace std::chrono;
namespace py = pybind11;
PYBIND11_PLUGIN(perf) {
py::module m("perf");
m.def("get_internals", []() -> double {
size_t p = 0;
const size_t n = 100 * 1000 * 1000;
auto t0 = high_resolution_clock::now();
for (size_t i = 0; i < n; i++)
p |= (size_t) &py::detail::get_internals();
auto t1 = high_resolution_clock::now();
p ^= (size_t) &py::detail::get_internals();
return p + duration_cast<nanoseconds>(t1 - t0).count() * 1. / n;
});
return m.ptr();
} It looks like this version #472 (comment) is 5x faster than the current implementation. Given it's only 0.5ns vs 2.5ns you could say it's quite negligible, but still... :) // I really hope the compiler didn't optimise something stupid away, but it doesn't look like it did. This is on OS X with -O3. |
Also... by adding a branch prediction hint: if (__builtin_expect(!internals_ptr, 0)) load_internals(internals_ptr); it goes further down to ~0.25ns, free win. 🐼 |
If you don't mind experimenting a bit more, try the version from your comment above. I suspect it will be just as fast even without the |
I think all of this is fast enough that we basically don't care -- fractions of nanoseconds don't matter much when the next Python C API call takes hundreds of them (plus, it's really tricky to measure stuff of that magnitude -- you'll generally want to view & understand the assembly to isolate the signal from noise, i.e. unrelated compiler passes). My main optimization goal for these things has always been to cut down on generated object code rather than shaving off a nanosecond somewhere. (After working with Boost.Python for many years, object code bloat was one of the things that really bothered me) This has involved aggressively un-templating certain pieces of code and playing with inline/noinline statements. I think it's clear that calling the following function
will generate more code (if referenced many times) than
In the first case, it's a load + conditional jump + function call. In the second case, it's a function call to a function that is instantiated just once. |
PS: There was a statement to the contrary above, which I admit I don't understand -- however, I think this general approach is sound |
Yep, that all makes sense, but doesn't seem to explain #472 (comment) which seems to be both faster (because function call is never reached after the very first call) and generate less (??) code, looks like a win/win unless we've missed something. I'd leave it for future consideration if we're not changing it now since it's not a direct part of this PR anyway (which is pretty much finished). |
@@ -655,99 +689,99 @@ struct field_descriptor { | |||
dtype descr; | |||
}; | |||
|
|||
template<typename F> | |||
static PYBIND11_NOINLINE void register_structured_dtype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably doesn't need to be a template: F
-> std::initializer_list<field_descriptor>
.
Also static
-> inline noinline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, fixed.
81e90ea
to
e44f35b
Compare
ah, I was going to merge this now, but it is conflicted after merging the other PR |
I could rebase if you want |
Yes, please do -- thanks! |
NumPy internals are stored under "_numpy_internals" key.
(avoid code bloat if possible)
e44f35b
to
cc8ff16
Compare
Rebased, should be good to go |
Great, thanks! |
This PR adds support for sharing registered dtypes across multiple extensions modules which previously may or may not have worked depending on the compiler, optimization settings, linker settings etc.
The dtypes are shared via the same capsule where the internals are stored. As part of this PR, a few functions are added to the public API so the "shared" part of the capsule could be accessed without breaking backwards compatibility (see
get_shared_data()
,set_shared_data()
).I've also moved out
register_dtype()
outside ofnpy_format_descriptor<>
template, this avoids considerable code bloat when registering many dtypes.(Original issue: #468)