Skip to content

Optimize Once::doit when initialization is already completed #13349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

stepancheg
Copy link
Contributor

Load is much cheaper than fetch_add, at least on x86_64.

Verified with this test:

static mut o: one::Once = one::ONCE_INIT;
loop {
    unsafe {
        let start = time::precise_time_ns();
        let iters = 50000000u64;
        for _ in range(0, iters) {
            o.doit(||{
                println!("once!");
            });
        }
        let end = time::precise_time_ns();
        let ps_per_iter = 1000 * (end - start) / iters;
        println!("{} ps per iter", ps_per_iter);
    }
}

Test executed on Mac, Intel Core i7 2GHz. Result is 700ps per iteration
with patch applied, and 17000ps per iteration without patch.

* Load is much cheaper than fetch_add, at least on x86_64.
* Common path of `doit` can be inlined

Verified with this test:

```
static mut o: one::Once = one::ONCE_INIT;
loop {
    unsafe {
        let start = time::precise_time_ns();
        let iters = 50000000u64;
        for _ in range(0, iters) {
            o.doit(|| { println!("once!"); });
        }
        let end = time::precise_time_ns();
        let ps_per_iter = 1000 * (end - start) / iters;
        println!("{} ps per iter", ps_per_iter);

        // confuse the optimizer
        o.doit(|| { println!("once!"); });
    }
}
```

Test executed on Mac, Intel Core i7 2GHz. Result is 700ps per iteration
with patch applied, and 17000ps per iteration without patch.
@lilyball
Copy link
Contributor

lilyball commented Apr 5, 2014

I think you mean µs, not ps.

@lilyball
Copy link
Contributor

lilyball commented Apr 5, 2014

I'm also worried about correctness. Is a relaxed load here sufficient?

@lilyball
Copy link
Contributor

lilyball commented Apr 6, 2014

A different change that's more likely to be safe is changing the self.cnt.store(int::MIN, atomics::SeqCst) to using atomics::Relaxed, since there's no ordering requirement there, it's only to prevent an infinite loop of doit's from behaving wrong.

@lilyball
Copy link
Contributor

lilyball commented Apr 6, 2014

BTW, your benchmarks are off. If you compiled your version of once in the same crate as your benchmark, optimization will screw with your actual timing.

In my tests, using a separate crate (and no LTO) to avoid the unwanted optimization, on a 3.4GHz Intel Core i7 iMac, I get approximately 13200µs for the libstd version, 3100µs for your version, and 9100µs for the version with my proposed conservative change.

@stepancheg
Copy link
Contributor Author

@kballard

I think you mean µs, not ps.

I did mean ps. One call to doit takes less than 1ns, so I multiplied result by 1000.

I'm also worried about correctness. Is a relaxed load here sufficient?

I'm not expert. I did more thinking and I think that "relaxed" guarantees that doit will be called exactly once, but does not provide guarantees that result of doit will be visible to another thread. So I'll change it to SeqCst in a minute.

@lilyball
Copy link
Contributor

lilyball commented Apr 6, 2014

@stepancheg Oops you're right. I misread the timing code. And actually thinking about it, µs is obviously wrong because that's too slow.

@stepancheg
Copy link
Contributor Author

@kballard what is "unwanted optimization", and why would you turn off LTO?

@lilyball
Copy link
Contributor

lilyball commented Apr 6, 2014

@stepancheg You're trying to time a delicate threading thing. In my tests, when the once code is in the same crate as the benchmarking code, I get approximately about 600ps from your code. But if I insert a single extra call to doit outside of your loop, the time per iter jumps to 3100ps. This is clearly nonsense, which means that extra call is preventing optimizations that rustc was doing. I'm assuming it was inlining the call to doit() (because it only had one call site), but the extra call made it reconsider.

And turning off LTO is because LTO could enable it to start making those unwanted optimizations again.

Taking optimization to its logical conclusion, if rustc could figure out not only that doit() was only called from one spot, but that this one spot was always on the same thread, it could ditch all the sequential consistency and memory barriers and see a massive speedup with no observable change to your code. But that's obviously terrible because we're explicitly trying to measure it with the memory barriers in place.

@stepancheg
Copy link
Contributor Author

@kballard that's interesting observation. Probably, doit should be split into two functions:

#[inline(always)]
fn doit(...) {
    if (load(...) < 0) {
        return;
    }
    doit_slow(...);
}

fn doit_slow() { ... }

@lilyball
Copy link
Contributor

lilyball commented Apr 6, 2014

I've come up with a slightly different algorithm for doit() that I believe is safe, and has the same performance as your code. Among other changes, it ditches SeqCst for proper acquire/release. However, being an atomic building block, it really needs a stringent review. I'll submit it as a separate PR.

@stepancheg
Copy link
Contributor Author

@kballard OK, waiting for your PR.

@stepancheg
Copy link
Contributor Author

BTW, I've uploaded new patch where doit split into two parts.

@lilyball
Copy link
Contributor

lilyball commented Apr 6, 2014

Filed as #13351.

@stepancheg
Copy link
Contributor Author

@kballard I'm not sure that your version is correct, but it is definitely easier to understand.

@huonw
Copy link
Member

huonw commented May 2, 2014

Ping. This is just waiting on a decision vs. #13351, right?

@lilyball
Copy link
Contributor

lilyball commented May 2, 2014

@huonw That's my assumption. I'm (obviously) still in favor of #13351, because I think the algorithm is easier to reason about (and of course it uses correct memory orderings so the slow path is faster, if that makes a difference).

@alexcrichton
Copy link
Member

I suppose I mentioned this on #13351, but not this one, but I have yet to be convinced that the foo and foo_slow functions are necessary.

@stepancheg
Copy link
Contributor Author

@alexcrichton doit_slow is needed to avoid marking whole doit as #[inline].

@alexcrichton
Copy link
Member

I do not disagree that doit should be inlined and doit_slow should not be inlined. I don't believe that there is sufficient data to start this trend of splitting functions into foo and foo_slow.

@stepancheg
Copy link
Contributor Author

@alexcrichton well, for now I can simplify this patch by merging doit and doit_slow into one function, and keep common path optimization. doit can be split later.

@lilyball
Copy link
Contributor

lilyball commented May 2, 2014

What is the downside of keeping doit and doit_slow separate? The code has already been written.

@alexcrichton
Copy link
Member

I view it as a premature optimization which hinders understanding what's going on. As I mentioned on the other PR, the optimization applies to many situations, none of which are doing this today.

Additionally, I don't remember seeing concrete data which shows definitively that a split version is better than a unified version.

@alexcrichton
Copy link
Member

Closing due to inactivity, but feel free to reopen with the unification of doit and doit_slot back into one method.

@lilyball
Copy link
Contributor

@alexcrichton I just tested, and the split actually helps. Specifically, in my bench harness, if I call once.doit() twice (as in, two separate calls in code), with the split both calls are inlined and the non-contested case takes 0.5ns. But without the split, the calls aren't inlined and it takes 3ns intead. Without the split a single call to once.doit() also inlines, but it's a shame that adding a second call will suddenly make the first less efficient.

I'm willing to resubmit my own version (#13351) without the split, but it seems like a flagrantly unnecessary 6x slowdown for no upside that I can see (you like to claim that it's more readable without the split, but that's entirely subjective and I do not agree).

@alexcrichton
Copy link
Member

On my machine, a call+ret instruction takes about 2ns. If you're invoking this in a tight loop, then it may make sense to split the two, but this is not the kind of function used in a tight loop. Let's keep it as one function.

@lilyball
Copy link
Contributor

@alexcrichton No, but it's the kind of instruction that should be able to be used indiscriminately in other methods, and those methods may be called in a tight loop.

I still don't understand why you want to keep it joined, though. You haven't provided any rationale at all aside from what amounts to 100% subjective personal preference.

@lilyball
Copy link
Contributor

The entire purpose of Once::doit() is to be as low-overhead as possible. If overhead doesn't matter, then I may as well just use a Mutex + bool. Given that, it seems preposterous to intentionally take a 6x slowdown in the fastpath with no objective upside. And when talking about subjective upside, I actually think the doit_slow version is more readable, because it makes it obvious that this is expected to be the slow path.

@stepancheg
Copy link
Contributor Author

@alexcrichton I was sure that it is exactly a function to be used in a tight loop. I use it in scenario like this:

struct Foo { ... }
struct Bar { foo: Option<Foo> }

impl Bar {
    // this function may be called frequently
    fn get_foo<'a>(&'a self) -> &'a Foo {
        if (foo.is_some()) {
            foo.get_ref()
        } else {
            static mut foo_default_instance: *Foo = 0 as ...;
            static once = ...
            once.doit(initialize foo_default_instance);
            cast::transmute(foo_default_instance)
        }
    }
}

@lilyball
Copy link
Contributor

@alexcrichton Here's a perfect example of needing Once in a tight loop. I just tested and calling mach_timebase_info() takes about 131ns/iter on my machine. I'd like to tweak libtime to use Once to only call it once (because the timebase never changes). This would happen in precise_time_ns(). precise_time_ns() is something I can easily see being called multiple times in a loop, if you have any need to time only portions of your loop (e.g. you have setup and a teardown that should be ignored from your benchmark).

Not only that, but even if you're not calling it in a loop, precise_time_ns() should avoid any unnecessary overhead. Without the split I can take the mach_timebase_info overhead from 131ns down to 3ns (after the first call, of course), but I'd rather take it down to 0.5ns.

bors added a commit that referenced this pull request May 15, 2014
Submitting PR again, because I cannot reopen #13349, and github does not attach new patch to that PR. 

=======

Optimize `Once::doit`: perform optimistic check that initializtion is
already completed.  `load` is much cheaper than `fetch_add` at least
on x86_64.

Verified with this test:

```
static mut o: one::Once = one::ONCE_INIT;
unsafe {
    loop {
        let start = time::precise_time_ns();
        let iters = 50000000u64;
        for _ in range(0, iters) {
            o.doit(|| { println!("once!"); });
        }
        let end = time::precise_time_ns();
        let ps_per_iter = 1000 * (end - start) / iters;
        println!("{} ps per iter", ps_per_iter);

        // confuse the optimizer
        o.doit(|| { println!("once!"); });
    }
}
```

Test executed on Mac, Intel Core i7 2GHz. Result is:
* 20ns per iteration without patch
*  4ns per iteration with this patch applied

Once.doit could be even faster (800ps per iteration), if `doit` function
was split into a pair of `doit`/`doit_slow`, and `doit` marked as
`#[inline]` like this:

```
#[inline(always)]
pub fn doit(&self, f: ||) {
    if self.cnt.load(atomics::SeqCst) < 0 {
        return
    }

    self.doit_slow(f);
}

fn doit_slow(&self, f: ||) { ... }
```
flip1995 pushed a commit to flip1995/rust that referenced this pull request Nov 7, 2024
…=blyxyas

Fix allow_attributes when expanded from some macros

fixes rust-lang#13349

The issue here was that the start pattern being matched on the original source code was not specific enough. When using derive macros or in the issue case a `#[repr(C)]` the `#` would match the start pattern meaning that the expanded macro appeared to be unchanged and clippy would lint it.

The change I made was to make the matching more specific by matching `#[ident` at the start. We still need the second string to match just the ident on its own because of things like `#[cfg_attr(panic = "unwind", allow(unused))]`.

I also noticed some typos with start and end, these code paths weren't being reached so this doesn't fix anything.

changelog: FP: [`allow_attributes`]: don't trigger when expanded from some macros
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants