Optimize Once::doit when initialization is already completed #13349

stepancheg · 2014-04-05T23:43:24Z

Load is much cheaper than fetch_add, at least on x86_64.

Verified with this test:

static mut o: one::Once = one::ONCE_INIT;
loop {
    unsafe {
        let start = time::precise_time_ns();
        let iters = 50000000u64;
        for _ in range(0, iters) {
            o.doit(||{
                println!("once!");
            });
        }
        let end = time::precise_time_ns();
        let ps_per_iter = 1000 * (end - start) / iters;
        println!("{} ps per iter", ps_per_iter);
    }
}

Test executed on Mac, Intel Core i7 2GHz. Result is 700ps per iteration
with patch applied, and 17000ps per iteration without patch.

* Load is much cheaper than fetch_add, at least on x86_64. * Common path of `doit` can be inlined Verified with this test: ``` static mut o: one::Once = one::ONCE_INIT; loop { unsafe { let start = time::precise_time_ns(); let iters = 50000000u64; for _ in range(0, iters) { o.doit(|| { println!("once!"); }); } let end = time::precise_time_ns(); let ps_per_iter = 1000 * (end - start) / iters; println!("{} ps per iter", ps_per_iter); // confuse the optimizer o.doit(|| { println!("once!"); }); } } ``` Test executed on Mac, Intel Core i7 2GHz. Result is 700ps per iteration with patch applied, and 17000ps per iteration without patch.

lilyball · 2014-04-05T23:50:30Z

I think you mean µs, not ps.

lilyball · 2014-04-05T23:51:50Z

I'm also worried about correctness. Is a relaxed load here sufficient?

lilyball · 2014-04-06T00:16:18Z

A different change that's more likely to be safe is changing the self.cnt.store(int::MIN, atomics::SeqCst) to using atomics::Relaxed, since there's no ordering requirement there, it's only to prevent an infinite loop of doit's from behaving wrong.

lilyball · 2014-04-06T00:34:23Z

BTW, your benchmarks are off. If you compiled your version of once in the same crate as your benchmark, optimization will screw with your actual timing.

In my tests, using a separate crate (and no LTO) to avoid the unwanted optimization, on a 3.4GHz Intel Core i7 iMac, I get approximately 13200µs for the libstd version, 3100µs for your version, and 9100µs for the version with my proposed conservative change.

stepancheg · 2014-04-06T00:39:07Z

@kballard

I think you mean µs, not ps.

I did mean ps. One call to doit takes less than 1ns, so I multiplied result by 1000.

I'm also worried about correctness. Is a relaxed load here sufficient?

I'm not expert. I did more thinking and I think that "relaxed" guarantees that doit will be called exactly once, but does not provide guarantees that result of doit will be visible to another thread. So I'll change it to SeqCst in a minute.

lilyball · 2014-04-06T00:40:51Z

@stepancheg Oops you're right. I misread the timing code. And actually thinking about it, µs is obviously wrong because that's too slow.

stepancheg · 2014-04-06T00:41:52Z

@kballard what is "unwanted optimization", and why would you turn off LTO?

lilyball · 2014-04-06T00:45:21Z

@stepancheg You're trying to time a delicate threading thing. In my tests, when the once code is in the same crate as the benchmarking code, I get approximately about 600ps from your code. But if I insert a single extra call to doit outside of your loop, the time per iter jumps to 3100ps. This is clearly nonsense, which means that extra call is preventing optimizations that rustc was doing. I'm assuming it was inlining the call to doit() (because it only had one call site), but the extra call made it reconsider.

And turning off LTO is because LTO could enable it to start making those unwanted optimizations again.

Taking optimization to its logical conclusion, if rustc could figure out not only that doit() was only called from one spot, but that this one spot was always on the same thread, it could ditch all the sequential consistency and memory barriers and see a massive speedup with no observable change to your code. But that's obviously terrible because we're explicitly trying to measure it with the memory barriers in place.

stepancheg · 2014-04-06T00:54:02Z

@kballard that's interesting observation. Probably, doit should be split into two functions:

#[inline(always)]
fn doit(...) {
    if (load(...) < 0) {
        return;
    }
    doit_slow(...);
}

fn doit_slow() { ... }

lilyball · 2014-04-06T01:17:42Z

I've come up with a slightly different algorithm for doit() that I believe is safe, and has the same performance as your code. Among other changes, it ditches SeqCst for proper acquire/release. However, being an atomic building block, it really needs a stringent review. I'll submit it as a separate PR.

stepancheg · 2014-04-06T01:21:47Z

@kballard OK, waiting for your PR.

stepancheg · 2014-04-06T01:22:59Z

BTW, I've uploaded new patch where doit split into two parts.

lilyball · 2014-04-06T01:32:03Z

Filed as #13351.

stepancheg · 2014-04-06T01:59:01Z

@kballard I'm not sure that your version is correct, but it is definitely easier to understand.

huonw · 2014-05-02T06:38:05Z

Ping. This is just waiting on a decision vs. #13351, right?

lilyball · 2014-05-02T06:39:28Z

@huonw That's my assumption. I'm (obviously) still in favor of #13351, because I think the algorithm is easier to reason about (and of course it uses correct memory orderings so the slow path is faster, if that makes a difference).

alexcrichton · 2014-05-02T16:32:08Z

I suppose I mentioned this on #13351, but not this one, but I have yet to be convinced that the foo and foo_slow functions are necessary.

stepancheg · 2014-05-02T16:35:09Z

@alexcrichton doit_slow is needed to avoid marking whole doit as #[inline].

alexcrichton · 2014-05-02T16:36:21Z

I do not disagree that doit should be inlined and doit_slow should not be inlined. I don't believe that there is sufficient data to start this trend of splitting functions into foo and foo_slow.

stepancheg · 2014-05-02T16:45:15Z

@alexcrichton well, for now I can simplify this patch by merging doit and doit_slow into one function, and keep common path optimization. doit can be split later.

lilyball · 2014-05-02T16:46:45Z

What is the downside of keeping doit and doit_slow separate? The code has already been written.

alexcrichton · 2014-05-02T16:49:03Z

I view it as a premature optimization which hinders understanding what's going on. As I mentioned on the other PR, the optimization applies to many situations, none of which are doing this today.

Additionally, I don't remember seeing concrete data which shows definitively that a split version is better than a unified version.

alexcrichton · 2014-05-12T21:27:00Z

Closing due to inactivity, but feel free to reopen with the unification of doit and doit_slot back into one method.

lilyball · 2014-05-12T21:49:14Z

@alexcrichton I just tested, and the split actually helps. Specifically, in my bench harness, if I call once.doit() twice (as in, two separate calls in code), with the split both calls are inlined and the non-contested case takes 0.5ns. But without the split, the calls aren't inlined and it takes 3ns intead. Without the split a single call to once.doit() also inlines, but it's a shame that adding a second call will suddenly make the first less efficient.

I'm willing to resubmit my own version (#13351) without the split, but it seems like a flagrantly unnecessary 6x slowdown for no upside that I can see (you like to claim that it's more readable without the split, but that's entirely subjective and I do not agree).

alexcrichton · 2014-05-12T23:58:09Z

On my machine, a call+ret instruction takes about 2ns. If you're invoking this in a tight loop, then it may make sense to split the two, but this is not the kind of function used in a tight loop. Let's keep it as one function.

lilyball · 2014-05-13T00:00:06Z

@alexcrichton No, but it's the kind of instruction that should be able to be used indiscriminately in other methods, and those methods may be called in a tight loop.

I still don't understand why you want to keep it joined, though. You haven't provided any rationale at all aside from what amounts to 100% subjective personal preference.

lilyball · 2014-05-13T00:02:56Z

The entire purpose of Once::doit() is to be as low-overhead as possible. If overhead doesn't matter, then I may as well just use a Mutex + bool. Given that, it seems preposterous to intentionally take a 6x slowdown in the fastpath with no objective upside. And when talking about subjective upside, I actually think the doit_slow version is more readable, because it makes it obvious that this is expected to be the slow path.

stepancheg · 2014-05-13T08:09:43Z

@alexcrichton I was sure that it is exactly a function to be used in a tight loop. I use it in scenario like this:

struct Foo { ... }
struct Bar { foo: Option<Foo> }

impl Bar {
    // this function may be called frequently
    fn get_foo<'a>(&'a self) -> &'a Foo {
        if (foo.is_some()) {
            foo.get_ref()
        } else {
            static mut foo_default_instance: *Foo = 0 as ...;
            static once = ...
            once.doit(initialize foo_default_instance);
            cast::transmute(foo_default_instance)
        }
    }
}

lilyball · 2014-05-13T18:08:28Z

@alexcrichton Here's a perfect example of needing Once in a tight loop. I just tested and calling mach_timebase_info() takes about 131ns/iter on my machine. I'd like to tweak libtime to use Once to only call it once (because the timebase never changes). This would happen in precise_time_ns(). precise_time_ns() is something I can easily see being called multiple times in a loop, if you have any need to time only portions of your loop (e.g. you have setup and a teardown that should be ignored from your benchmark).

Not only that, but even if you're not calling it in a loop, precise_time_ns() should avoid any unnecessary overhead. Without the split I can take the mach_timebase_info overhead from 131ns down to 3ns (after the first call, of course), but I'd rather take it down to 0.5ns.

Submitting PR again, because I cannot reopen #13349, and github does not attach new patch to that PR. ======= Optimize `Once::doit`: perform optimistic check that initializtion is already completed. `load` is much cheaper than `fetch_add` at least on x86_64. Verified with this test: ``` static mut o: one::Once = one::ONCE_INIT; unsafe { loop { let start = time::precise_time_ns(); let iters = 50000000u64; for _ in range(0, iters) { o.doit(|| { println!("once!"); }); } let end = time::precise_time_ns(); let ps_per_iter = 1000 * (end - start) / iters; println!("{} ps per iter", ps_per_iter); // confuse the optimizer o.doit(|| { println!("once!"); }); } } ``` Test executed on Mac, Intel Core i7 2GHz. Result is: * 20ns per iteration without patch * 4ns per iteration with this patch applied Once.doit could be even faster (800ps per iteration), if `doit` function was split into a pair of `doit`/`doit_slow`, and `doit` marked as `#[inline]` like this: ``` #[inline(always)] pub fn doit(&self, f: ||) { if self.cnt.load(atomics::SeqCst) < 0 { return } self.doit_slow(f); } fn doit_slow(&self, f: ||) { ... } ```

…=blyxyas Fix allow_attributes when expanded from some macros fixes rust-lang#13349 The issue here was that the start pattern being matched on the original source code was not specific enough. When using derive macros or in the issue case a `#[repr(C)]` the `#` would match the start pattern meaning that the expanded macro appeared to be unchanged and clippy would lint it. The change I made was to make the matching more specific by matching `#[ident` at the start. We still need the second string to match just the ident on its own because of things like `#[cfg_attr(panic = "unwind", allow(unused))]`. I also noticed some typos with start and end, these code paths weren't being reached so this doesn't fix anything. changelog: FP: [`allow_attributes`]: don't trigger when expanded from some macros

alexcrichton mentioned this pull request Apr 6, 2014

Rewrite sync::one::Once.doit #13351

Closed

alexcrichton closed this May 12, 2014

stepancheg mentioned this pull request May 13, 2014

Optimize common path of Once::doit #14174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Once::doit when initialization is already completed #13349

Optimize Once::doit when initialization is already completed #13349

stepancheg commented Apr 5, 2014

lilyball commented Apr 5, 2014

lilyball commented Apr 5, 2014

lilyball commented Apr 6, 2014

lilyball commented Apr 6, 2014

stepancheg commented Apr 6, 2014

lilyball commented Apr 6, 2014

stepancheg commented Apr 6, 2014

lilyball commented Apr 6, 2014

stepancheg commented Apr 6, 2014

lilyball commented Apr 6, 2014

stepancheg commented Apr 6, 2014

stepancheg commented Apr 6, 2014

lilyball commented Apr 6, 2014

stepancheg commented Apr 6, 2014

huonw commented May 2, 2014

lilyball commented May 2, 2014

alexcrichton commented May 2, 2014

stepancheg commented May 2, 2014

alexcrichton commented May 2, 2014

stepancheg commented May 2, 2014

lilyball commented May 2, 2014

alexcrichton commented May 2, 2014

alexcrichton commented May 12, 2014

lilyball commented May 12, 2014

alexcrichton commented May 12, 2014

lilyball commented May 13, 2014

lilyball commented May 13, 2014

stepancheg commented May 13, 2014

lilyball commented May 13, 2014

Optimize Once::doit when initialization is already completed #13349

Optimize Once::doit when initialization is already completed #13349

Conversation

stepancheg commented Apr 5, 2014

lilyball commented Apr 5, 2014

lilyball commented Apr 5, 2014

lilyball commented Apr 6, 2014

lilyball commented Apr 6, 2014

stepancheg commented Apr 6, 2014

lilyball commented Apr 6, 2014

stepancheg commented Apr 6, 2014

lilyball commented Apr 6, 2014

stepancheg commented Apr 6, 2014

lilyball commented Apr 6, 2014

stepancheg commented Apr 6, 2014

stepancheg commented Apr 6, 2014

lilyball commented Apr 6, 2014

stepancheg commented Apr 6, 2014

huonw commented May 2, 2014

lilyball commented May 2, 2014

alexcrichton commented May 2, 2014

stepancheg commented May 2, 2014

alexcrichton commented May 2, 2014

stepancheg commented May 2, 2014

lilyball commented May 2, 2014

alexcrichton commented May 2, 2014

alexcrichton commented May 12, 2014

lilyball commented May 12, 2014

alexcrichton commented May 12, 2014

lilyball commented May 13, 2014

lilyball commented May 13, 2014

stepancheg commented May 13, 2014

lilyball commented May 13, 2014