Support for sharing dtypes across extensions + public shared data API #472

aldanor · 2016-10-31T16:26:19Z

This PR adds support for sharing registered dtypes across multiple extensions modules which previously may or may not have worked depending on the compiler, optimization settings, linker settings etc.

The dtypes are shared via the same capsule where the internals are stored. As part of this PR, a few functions are added to the public API so the "shared" part of the capsule could be accessed without breaking backwards compatibility (see get_shared_data(), set_shared_data()).

I've also moved out register_dtype() outside of npy_format_descriptor<> template, this avoids considerable code bloat when registering many dtypes.

(Original issue: #468)

aldanor · 2016-10-31T16:29:46Z

A few minor outstanding questions:

Should the key type of shared_data be std::string or const char * (in which case it will need a comparator/hasher as well)? I don't think it matters much at all since it's meant to be used in initialization code that is likely to only run once during the program lifetime, so I just used std::string.
Should detail::get_internals() in cast.h be actually not noinline same as it's done in numpy.h in this PR by splitting in into two parts?

wjakob · 2016-10-31T19:19:10Z

Good point about std::string vs char * -- let's use the former then.

The purpose of NOINLINE is to prohibit any kind of inlining for medium-sized functions that are called from templated code, or simply from many different places. (for trivial ones we don't care, and large-sized functions wouldn't actually get inlined). get_internals() definitely satisfies that criterion. A function call is fairly compact and leads to considerable object code savings in this case. I was able to shave off quite a bit throughout the development of this project.

wjakob · 2016-10-31T19:20:06Z

PS: It would be nice if you could add a couple of lines about get_shared / set_shared to the docs.

wjakob · 2016-10-31T19:20:13Z

(whoops)

aldanor · 2016-10-31T19:22:58Z

Re: noinline, wouldn't this be better (as in avoiding a non-inlined function call on all invocations except the very first one)?

PYBIND11_NOINLINE inline internals &load_internals() {
    // capsule loading etc
}

inline internals &get_internals() {
    static internals *internals_ptr = nullptr;
    if (internals_ptr)
        return *internals_ptr;
    internals_ptr = load_internals();
    return *internals_ptr;
}

aldanor · 2016-10-31T19:24:02Z

(typo in the example above, fixed)

wjakob · 2016-10-31T19:30:20Z

Take a look at this in a disassembler, you'll be surprised :) -- static local variables are tricky, and they generate a fair bit of code (much more than a function call).

jagerman · 2016-10-31T19:31:28Z

Give it a try and see how much it enlarges the .so. It's probably enough of an increase that it isn't worthwhile, but would be nice to see anyway.

jagerman · 2016-10-31T20:04:24Z

Curiously, I see a slight decrease in total .so size when doing @aldanor's suggestion--though to make it work, I changed it like so:

PYBIND11_NOINLINE inline void load_internals(internals *&internals_ptr) {
    ...
}

__attribute__((always_inline)) inline internals &get_internals() {
    static internals *internals_ptr = nullptr;
    if (!internals_ptr) load_internals(internals_ptr);
    return *internals_ptr;
}

Without the inline-forcing attribute (or with inline forced off using PYBIND11_NOINLINE) I get an .so size of 985816; with inline forced, I get a slight .so drop to 985688.

With clang++-3.8 on debian, I see no difference in .so at all.

aldanor · 2016-10-31T20:09:30Z

That's curious indeed :)

Btw I think you could simplify it even a bit further without an if check, no?

PYBIND11_NOINLINE internals* void load_internals() {
    ...
}

__attribute__((always_inline)) inline internals &get_internals() {
    static auto ptr = load_internals();
    return *ptr;
}

jagerman · 2016-10-31T20:18:24Z

That way slightly reduces the non-inlined case to 985624, but makes the inlined case considerably larger, at 989784. (And just for reference, the current master, single-function .so size is 985528).

aldanor · 2016-10-31T21:44:16Z

Added an example in the docs for the shared_data API, changed get_numpy_internals() as in the first @jagerman's example.

Re: get_internals(), what's the consensus? :)

dean0x7d · 2016-10-31T23:22:28Z

Initializing a static variable to a compile-time constant eliminates most of the static machinery (guard variable + thread safe initialization), which explains the size differences between the two implementations above. See the generated assembly for get_internals1 vs get_internals2: https://godbolt.org/g/peBY1L

As for splitting up get_internals() into always_inline and noinline: Is the goal better performance or smaller binary size? If it's for performance, it would be good to measure the runtime before complicating the code.

aldanor · 2016-11-01T00:10:10Z

Out of sheer curiosity, I tried running something like this:

#include <chrono>
#include "include/pybind11/pybind11.h"

using namespace std::chrono;
namespace py = pybind11;

PYBIND11_PLUGIN(perf) {
    py::module m("perf");
    m.def("get_internals", []() -> double {
        size_t p = 0;
        const size_t n = 100 * 1000 * 1000;
        auto t0 = high_resolution_clock::now();
        for (size_t i = 0; i < n; i++)
            p |= (size_t) &py::detail::get_internals();
        auto t1 = high_resolution_clock::now();
        p ^= (size_t) &py::detail::get_internals();
        return p + duration_cast<nanoseconds>(t1 - t0).count() * 1. / n;
    });
    return m.ptr();
}

It looks like this version #472 (comment) is 5x faster than the current implementation. Given it's only 0.5ns vs 2.5ns you could say it's quite negligible, but still... :)

// I really hope the compiler didn't optimise something stupid away, but it doesn't look like it did. This is on OS X with -O3.

aldanor · 2016-11-01T00:17:29Z

Also... by adding a branch prediction hint:

if (__builtin_expect(!internals_ptr, 0)) load_internals(internals_ptr);

it goes further down to ~0.25ns, free win.

🐼

dean0x7d · 2016-11-01T00:35:17Z

If you don't mind experimenting a bit more, try the version from your comment above. I suspect it will be just as fast even without the expect hint (value semantics vs. mutable reference parameters).

aldanor · 2016-11-01T00:46:01Z

@dean0x7d Just ran that version, it seems to yield exact same timings as in @jagerman's example, 0.5ns without the branch hint, 0.25ns with it.

wjakob · 2016-11-01T12:12:21Z

I think all of this is fast enough that we basically don't care -- fractions of nanoseconds don't matter much when the next Python C API call takes hundreds of them (plus, it's really tricky to measure stuff of that magnitude -- you'll generally want to view & understand the assembly to isolate the signal from noise, i.e. unrelated compiler passes). My main optimization goal for these things has always been to cut down on generated object code rather than shaving off a nanosecond somewhere. (After working with Boost.Python for many years, object code bloat was one of the things that really bothered me) This has involved aggressively un-templating certain pieces of code and playing with inline/noinline statements.

I think it's clear that calling the following function

__attribute__((always_inline)) inline internals &get_internals() {
    static internals *internals_ptr = nullptr;
    if (!internals_ptr) load_internals(internals_ptr);
    return *internals_ptr;
}

will generate more code (if referenced many times) than

PYBIND11_NOINLINE inline internals &get_internals() {
 /* anything */
}

In the first case, it's a load + conditional jump + function call. In the second case, it's a function call to a function that is instantiated just once.

wjakob · 2016-11-01T12:13:51Z

PS: There was a statement to the contrary above, which I admit I don't understand -- however, I think this general approach is sound

aldanor · 2016-11-01T12:19:47Z

Yep, that all makes sense, but doesn't seem to explain #472 (comment) which seems to be both faster (because function call is never reached after the very first call) and generate less (??) code, looks like a win/win unless we've missed something.

I'd leave it for future consideration if we're not changing it now since it's not a direct part of this PR anyway (which is pretty much finished).

dean0x7d · 2016-11-01T12:53:09Z

include/pybind11/numpy.h

@@ -655,99 +689,99 @@ struct field_descriptor {
    dtype descr;
 };

+template<typename F>
+static PYBIND11_NOINLINE void register_structured_dtype(


Probably doesn't need to be a template: F -> std::initializer_list<field_descriptor>.
Also static -> inline noinline

Agreed, fixed.

wjakob · 2016-11-03T08:45:13Z

ah, I was going to merge this now, but it is conflicted after merging the other PR

aldanor · 2016-11-03T08:54:12Z

I could rebase if you want

wjakob · 2016-11-03T09:24:16Z

Yes, please do -- thanks!

NumPy internals are stored under "_numpy_internals" key.

(avoid code bloat if possible)

aldanor · 2016-11-03T10:02:01Z

Rebased, should be good to go

wjakob · 2016-11-03T10:08:49Z

Great, thanks!

aldanor mentioned this pull request Oct 31, 2016

NumPy registered dtypes may not be shared across modules #468

Closed

aldanor force-pushed the feature/shared-dtypes branch from e2628b8 to ee45775 Compare October 31, 2016 16:49

wjakob closed this Oct 31, 2016

wjakob reopened this Oct 31, 2016

aldanor force-pushed the feature/shared-dtypes branch from ee45775 to 0a53644 Compare October 31, 2016 21:42

dean0x7d reviewed Nov 1, 2016

View reviewed changes

aldanor force-pushed the feature/shared-dtypes branch 3 times, most recently from 81e90ea to e44f35b Compare November 2, 2016 08:55

aldanor added 5 commits November 3, 2016 09:35

NumPy dtypes are now shared across extensions

2184f6d

Use pytest fixtures in numpy dtypes test module

c546655

Add public shared_data API

2dbf029

NumPy internals are stored under "_numpy_internals" key.

Add docs re: shared data API

f95fda0

Move register_dtype() outside of the template

cc8ff16

(avoid code bloat if possible)

aldanor force-pushed the feature/shared-dtypes branch from e44f35b to cc8ff16 Compare November 3, 2016 09:36

wjakob merged commit 0a9ef9c into pybind:master Nov 3, 2016

rwgk mentioned this pull request Feb 9, 2023

FWD pybind11 google/pybind11clif#472

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for sharing dtypes across extensions + public shared data API #472

Support for sharing dtypes across extensions + public shared data API #472

aldanor commented Oct 31, 2016 •

edited

Loading

aldanor commented Oct 31, 2016 •

edited

Loading

wjakob commented Oct 31, 2016

wjakob commented Oct 31, 2016

wjakob commented Oct 31, 2016

aldanor commented Oct 31, 2016 •

edited

Loading

aldanor commented Oct 31, 2016

wjakob commented Oct 31, 2016

jagerman commented Oct 31, 2016

jagerman commented Oct 31, 2016 •

edited

Loading

aldanor commented Oct 31, 2016 •

edited

Loading

jagerman commented Oct 31, 2016

aldanor commented Oct 31, 2016 •

edited

Loading

dean0x7d commented Oct 31, 2016

aldanor commented Nov 1, 2016 •

edited

Loading

aldanor commented Nov 1, 2016 •

edited

Loading

dean0x7d commented Nov 1, 2016

aldanor commented Nov 1, 2016

wjakob commented Nov 1, 2016

wjakob commented Nov 1, 2016

aldanor commented Nov 1, 2016 •

edited

Loading

dean0x7d Nov 1, 2016

aldanor Nov 1, 2016

wjakob commented Nov 3, 2016

aldanor commented Nov 3, 2016

wjakob commented Nov 3, 2016

aldanor commented Nov 3, 2016

wjakob commented Nov 3, 2016

Support for sharing dtypes across extensions + public shared data API #472

Support for sharing dtypes across extensions + public shared data API #472

Conversation

aldanor commented Oct 31, 2016 • edited Loading

aldanor commented Oct 31, 2016 • edited Loading

wjakob commented Oct 31, 2016

wjakob commented Oct 31, 2016

wjakob commented Oct 31, 2016

aldanor commented Oct 31, 2016 • edited Loading

aldanor commented Oct 31, 2016

wjakob commented Oct 31, 2016

jagerman commented Oct 31, 2016

jagerman commented Oct 31, 2016 • edited Loading

aldanor commented Oct 31, 2016 • edited Loading

jagerman commented Oct 31, 2016

aldanor commented Oct 31, 2016 • edited Loading

dean0x7d commented Oct 31, 2016

aldanor commented Nov 1, 2016 • edited Loading

aldanor commented Nov 1, 2016 • edited Loading

dean0x7d commented Nov 1, 2016

aldanor commented Nov 1, 2016

wjakob commented Nov 1, 2016

wjakob commented Nov 1, 2016

aldanor commented Nov 1, 2016 • edited Loading

dean0x7d Nov 1, 2016

Choose a reason for hiding this comment

aldanor Nov 1, 2016

Choose a reason for hiding this comment

wjakob commented Nov 3, 2016

aldanor commented Nov 3, 2016

wjakob commented Nov 3, 2016

aldanor commented Nov 3, 2016

wjakob commented Nov 3, 2016

aldanor commented Oct 31, 2016 •

edited

Loading

aldanor commented Oct 31, 2016 •

edited

Loading

aldanor commented Oct 31, 2016 •

edited

Loading

jagerman commented Oct 31, 2016 •

edited

Loading

aldanor commented Oct 31, 2016 •

edited

Loading

aldanor commented Oct 31, 2016 •

edited

Loading

aldanor commented Nov 1, 2016 •

edited

Loading

aldanor commented Nov 1, 2016 •

edited

Loading

aldanor commented Nov 1, 2016 •

edited

Loading