Skip to content

Support for sharing dtypes across extensions + public shared data API #472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 3, 2016

Conversation

aldanor
Copy link
Member

@aldanor aldanor commented Oct 31, 2016

This PR adds support for sharing registered dtypes across multiple extensions modules which previously may or may not have worked depending on the compiler, optimization settings, linker settings etc.

The dtypes are shared via the same capsule where the internals are stored. As part of this PR, a few functions are added to the public API so the "shared" part of the capsule could be accessed without breaking backwards compatibility (see get_shared_data(), set_shared_data()).

I've also moved out register_dtype() outside of npy_format_descriptor<> template, this avoids considerable code bloat when registering many dtypes.

(Original issue: #468)

@aldanor
Copy link
Member Author

aldanor commented Oct 31, 2016

A few minor outstanding questions:

  • Should the key type of shared_data be std::string or const char * (in which case it will need a comparator/hasher as well)? I don't think it matters much at all since it's meant to be used in initialization code that is likely to only run once during the program lifetime, so I just used std::string.
  • Should detail::get_internals() in cast.h be actually not noinline same as it's done in numpy.h in this PR by splitting in into two parts?

@wjakob
Copy link
Member

wjakob commented Oct 31, 2016

Good point about std::string vs char * -- let's use the former then.

The purpose of NOINLINE is to prohibit any kind of inlining for medium-sized functions that are called from templated code, or simply from many different places. (for trivial ones we don't care, and large-sized functions wouldn't actually get inlined). get_internals() definitely satisfies that criterion. A function call is fairly compact and leads to considerable object code savings in this case. I was able to shave off quite a bit throughout the development of this project.

@wjakob
Copy link
Member

wjakob commented Oct 31, 2016

PS: It would be nice if you could add a couple of lines about get_shared / set_shared to the docs.

@wjakob wjakob closed this Oct 31, 2016
@wjakob wjakob reopened this Oct 31, 2016
@wjakob
Copy link
Member

wjakob commented Oct 31, 2016

(whoops)

@aldanor
Copy link
Member Author

aldanor commented Oct 31, 2016

Re: noinline, wouldn't this be better (as in avoiding a non-inlined function call on all invocations except the very first one)?

PYBIND11_NOINLINE inline internals &load_internals() {
    // capsule loading etc
}

inline internals &get_internals() {
    static internals *internals_ptr = nullptr;
    if (internals_ptr)
        return *internals_ptr;
    internals_ptr = load_internals();
    return *internals_ptr;
}

@aldanor
Copy link
Member Author

aldanor commented Oct 31, 2016

(typo in the example above, fixed)

@wjakob
Copy link
Member

wjakob commented Oct 31, 2016

Take a look at this in a disassembler, you'll be surprised :) -- static local variables are tricky, and they generate a fair bit of code (much more than a function call).

@jagerman
Copy link
Member

Give it a try and see how much it enlarges the .so. It's probably enough of an increase that it isn't worthwhile, but would be nice to see anyway.

@jagerman
Copy link
Member

jagerman commented Oct 31, 2016

Curiously, I see a slight decrease in total .so size when doing @aldanor's suggestion--though to make it work, I changed it like so:

PYBIND11_NOINLINE inline void load_internals(internals *&internals_ptr) {
    ...
}

__attribute__((always_inline)) inline internals &get_internals() {
    static internals *internals_ptr = nullptr;
    if (!internals_ptr) load_internals(internals_ptr);
    return *internals_ptr;
}

Without the inline-forcing attribute (or with inline forced off using PYBIND11_NOINLINE) I get an .so size of 985816; with inline forced, I get a slight .so drop to 985688.

With clang++-3.8 on debian, I see no difference in .so at all.

@aldanor
Copy link
Member Author

aldanor commented Oct 31, 2016

That's curious indeed :)

Btw I think you could simplify it even a bit further without an if check, no?

PYBIND11_NOINLINE internals* void load_internals() {
    ...
}

__attribute__((always_inline)) inline internals &get_internals() {
    static auto ptr = load_internals();
    return *ptr;
}

@jagerman
Copy link
Member

That way slightly reduces the non-inlined case to 985624, but makes the inlined case considerably larger, at 989784. (And just for reference, the current master, single-function .so size is 985528).

@aldanor aldanor force-pushed the feature/shared-dtypes branch from ee45775 to 0a53644 Compare October 31, 2016 21:42
@aldanor
Copy link
Member Author

aldanor commented Oct 31, 2016

Added an example in the docs for the shared_data API, changed get_numpy_internals() as in the first @jagerman's example.

Re: get_internals(), what's the consensus? :)

@dean0x7d
Copy link
Member

Initializing a static variable to a compile-time constant eliminates most of the static machinery (guard variable + thread safe initialization), which explains the size differences between the two implementations above. See the generated assembly for get_internals1 vs get_internals2: https://godbolt.org/g/peBY1L

As for splitting up get_internals() into always_inline and noinline: Is the goal better performance or smaller binary size? If it's for performance, it would be good to measure the runtime before complicating the code.

@aldanor
Copy link
Member Author

aldanor commented Nov 1, 2016

Out of sheer curiosity, I tried running something like this:

#include <chrono>
#include "include/pybind11/pybind11.h"

using namespace std::chrono;
namespace py = pybind11;

PYBIND11_PLUGIN(perf) {
    py::module m("perf");
    m.def("get_internals", []() -> double {
        size_t p = 0;
        const size_t n = 100 * 1000 * 1000;
        auto t0 = high_resolution_clock::now();
        for (size_t i = 0; i < n; i++)
            p |= (size_t) &py::detail::get_internals();
        auto t1 = high_resolution_clock::now();
        p ^= (size_t) &py::detail::get_internals();
        return p + duration_cast<nanoseconds>(t1 - t0).count() * 1. / n;
    });
    return m.ptr();
}

It looks like this version #472 (comment) is 5x faster than the current implementation. Given it's only 0.5ns vs 2.5ns you could say it's quite negligible, but still... :)

// I really hope the compiler didn't optimise something stupid away, but it doesn't look like it did. This is on OS X with -O3.

@aldanor
Copy link
Member Author

aldanor commented Nov 1, 2016

Also... by adding a branch prediction hint:

if (__builtin_expect(!internals_ptr, 0)) load_internals(internals_ptr);

it goes further down to ~0.25ns, free win.

🐼

@dean0x7d
Copy link
Member

dean0x7d commented Nov 1, 2016

If you don't mind experimenting a bit more, try the version from your comment above. I suspect it will be just as fast even without the expect hint (value semantics vs. mutable reference parameters).

@aldanor
Copy link
Member Author

aldanor commented Nov 1, 2016

@dean0x7d Just ran that version, it seems to yield exact same timings as in @jagerman's example, 0.5ns without the branch hint, 0.25ns with it.

@wjakob
Copy link
Member

wjakob commented Nov 1, 2016

I think all of this is fast enough that we basically don't care -- fractions of nanoseconds don't matter much when the next Python C API call takes hundreds of them (plus, it's really tricky to measure stuff of that magnitude -- you'll generally want to view & understand the assembly to isolate the signal from noise, i.e. unrelated compiler passes). My main optimization goal for these things has always been to cut down on generated object code rather than shaving off a nanosecond somewhere. (After working with Boost.Python for many years, object code bloat was one of the things that really bothered me) This has involved aggressively un-templating certain pieces of code and playing with inline/noinline statements.

I think it's clear that calling the following function

__attribute__((always_inline)) inline internals &get_internals() {
    static internals *internals_ptr = nullptr;
    if (!internals_ptr) load_internals(internals_ptr);
    return *internals_ptr;
}

will generate more code (if referenced many times) than

PYBIND11_NOINLINE inline internals &get_internals() {
 /* anything */
}

In the first case, it's a load + conditional jump + function call. In the second case, it's a function call to a function that is instantiated just once.

@wjakob
Copy link
Member

wjakob commented Nov 1, 2016

PS: There was a statement to the contrary above, which I admit I don't understand -- however, I think this general approach is sound

@aldanor
Copy link
Member Author

aldanor commented Nov 1, 2016

Yep, that all makes sense, but doesn't seem to explain #472 (comment) which seems to be both faster (because function call is never reached after the very first call) and generate less (??) code, looks like a win/win unless we've missed something.

I'd leave it for future consideration if we're not changing it now since it's not a direct part of this PR anyway (which is pretty much finished).

@@ -655,99 +689,99 @@ struct field_descriptor {
dtype descr;
};

template<typename F>
static PYBIND11_NOINLINE void register_structured_dtype(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably doesn't need to be a template: F -> std::initializer_list<field_descriptor>.
Also static -> inline noinline

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, fixed.

@aldanor aldanor force-pushed the feature/shared-dtypes branch 3 times, most recently from 81e90ea to e44f35b Compare November 2, 2016 08:55
@wjakob
Copy link
Member

wjakob commented Nov 3, 2016

ah, I was going to merge this now, but it is conflicted after merging the other PR

@aldanor
Copy link
Member Author

aldanor commented Nov 3, 2016

I could rebase if you want

@wjakob
Copy link
Member

wjakob commented Nov 3, 2016

Yes, please do -- thanks!

@aldanor aldanor force-pushed the feature/shared-dtypes branch from e44f35b to cc8ff16 Compare November 3, 2016 09:36
@aldanor
Copy link
Member Author

aldanor commented Nov 3, 2016

Rebased, should be good to go

@wjakob
Copy link
Member

wjakob commented Nov 3, 2016

Great, thanks!

@wjakob wjakob merged commit 0a9ef9c into pybind:master Nov 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants