-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Optimize Once::doit when initialization is already completed #13349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Load is much cheaper than fetch_add, at least on x86_64. * Common path of `doit` can be inlined Verified with this test: ``` static mut o: one::Once = one::ONCE_INIT; loop { unsafe { let start = time::precise_time_ns(); let iters = 50000000u64; for _ in range(0, iters) { o.doit(|| { println!("once!"); }); } let end = time::precise_time_ns(); let ps_per_iter = 1000 * (end - start) / iters; println!("{} ps per iter", ps_per_iter); // confuse the optimizer o.doit(|| { println!("once!"); }); } } ``` Test executed on Mac, Intel Core i7 2GHz. Result is 700ps per iteration with patch applied, and 17000ps per iteration without patch.
I think you mean µs, not ps. |
I'm also worried about correctness. Is a relaxed load here sufficient? |
A different change that's more likely to be safe is changing the |
BTW, your benchmarks are off. If you compiled your version of In my tests, using a separate crate (and no LTO) to avoid the unwanted optimization, on a 3.4GHz Intel Core i7 iMac, I get approximately 13200µs for the libstd version, 3100µs for your version, and 9100µs for the version with my proposed conservative change. |
I did mean ps. One call to
I'm not expert. I did more thinking and I think that "relaxed" guarantees that |
@stepancheg Oops you're right. I misread the timing code. And actually thinking about it, µs is obviously wrong because that's too slow. |
@kballard what is "unwanted optimization", and why would you turn off LTO? |
@stepancheg You're trying to time a delicate threading thing. In my tests, when the And turning off LTO is because LTO could enable it to start making those unwanted optimizations again. Taking optimization to its logical conclusion, if rustc could figure out not only that |
@kballard that's interesting observation. Probably,
|
I've come up with a slightly different algorithm for |
@kballard OK, waiting for your PR. |
BTW, I've uploaded new patch where |
Filed as #13351. |
@kballard I'm not sure that your version is correct, but it is definitely easier to understand. |
Ping. This is just waiting on a decision vs. #13351, right? |
I suppose I mentioned this on #13351, but not this one, but I have yet to be convinced that the |
@alexcrichton |
I do not disagree that |
@alexcrichton well, for now I can simplify this patch by merging |
What is the downside of keeping |
I view it as a premature optimization which hinders understanding what's going on. As I mentioned on the other PR, the optimization applies to many situations, none of which are doing this today. Additionally, I don't remember seeing concrete data which shows definitively that a split version is better than a unified version. |
Closing due to inactivity, but feel free to reopen with the unification of |
@alexcrichton I just tested, and the split actually helps. Specifically, in my bench harness, if I call I'm willing to resubmit my own version (#13351) without the split, but it seems like a flagrantly unnecessary 6x slowdown for no upside that I can see (you like to claim that it's more readable without the split, but that's entirely subjective and I do not agree). |
On my machine, a call+ret instruction takes about 2ns. If you're invoking this in a tight loop, then it may make sense to split the two, but this is not the kind of function used in a tight loop. Let's keep it as one function. |
@alexcrichton No, but it's the kind of instruction that should be able to be used indiscriminately in other methods, and those methods may be called in a tight loop. I still don't understand why you want to keep it joined, though. You haven't provided any rationale at all aside from what amounts to 100% subjective personal preference. |
The entire purpose of |
@alexcrichton I was sure that it is exactly a function to be used in a tight loop. I use it in scenario like this:
|
@alexcrichton Here's a perfect example of needing Not only that, but even if you're not calling it in a loop, |
Submitting PR again, because I cannot reopen #13349, and github does not attach new patch to that PR. ======= Optimize `Once::doit`: perform optimistic check that initializtion is already completed. `load` is much cheaper than `fetch_add` at least on x86_64. Verified with this test: ``` static mut o: one::Once = one::ONCE_INIT; unsafe { loop { let start = time::precise_time_ns(); let iters = 50000000u64; for _ in range(0, iters) { o.doit(|| { println!("once!"); }); } let end = time::precise_time_ns(); let ps_per_iter = 1000 * (end - start) / iters; println!("{} ps per iter", ps_per_iter); // confuse the optimizer o.doit(|| { println!("once!"); }); } } ``` Test executed on Mac, Intel Core i7 2GHz. Result is: * 20ns per iteration without patch * 4ns per iteration with this patch applied Once.doit could be even faster (800ps per iteration), if `doit` function was split into a pair of `doit`/`doit_slow`, and `doit` marked as `#[inline]` like this: ``` #[inline(always)] pub fn doit(&self, f: ||) { if self.cnt.load(atomics::SeqCst) < 0 { return } self.doit_slow(f); } fn doit_slow(&self, f: ||) { ... } ```
…=blyxyas Fix allow_attributes when expanded from some macros fixes rust-lang#13349 The issue here was that the start pattern being matched on the original source code was not specific enough. When using derive macros or in the issue case a `#[repr(C)]` the `#` would match the start pattern meaning that the expanded macro appeared to be unchanged and clippy would lint it. The change I made was to make the matching more specific by matching `#[ident` at the start. We still need the second string to match just the ident on its own because of things like `#[cfg_attr(panic = "unwind", allow(unused))]`. I also noticed some typos with start and end, these code paths weren't being reached so this doesn't fix anything. changelog: FP: [`allow_attributes`]: don't trigger when expanded from some macros
Load is much cheaper than fetch_add, at least on x86_64.
Verified with this test:
Test executed on Mac, Intel Core i7 2GHz. Result is 700ps per iteration
with patch applied, and 17000ps per iteration without patch.