-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Strange perforamnce drops with const literals in closures #68632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
OT hint - if you append triple backticks with a word |
Thanks, I was wondering whats wrong with syntax highlight 👍 |
I also just checked it inside WSL with stable and nightly -x86_64-unknown-linux-gnu and the results were 33% performance difference on my machine. |
It's not the optimizer: https://godbolt.org/z/LcLYaw You can try disabling incremental compilation and reducing codegen-units to 1 to see if you can still reproduce it. |
How weird indeed, if I disable incremental and set codegen-units to 1 both versions run in slowest time, but if I leave values at default the one with let runs in 20% faster.
If I comment out profile settings in Cargo.toml, the output is:
So the weird thing is, that I located that code-units=16 actually allow to produce more effective code for bench than code-units=1 (for the one with let bind) BTW, P.S. or it may be a bug in a criterion I didn't double checked the runtime with clock :) |
It also might not be a bug at all, benchmarking such things is incredible tricky, you might get hit by code alignment issues or just sheer bad luck. When I benchmark the first function that you posted multiple times in a row, I sometimes get around 10 % differences (by running the SAME code). And as @mati685 posted, it seems to generate exactly the same assembly, as it should. Really the only way to check what happens is for you to disassemble the two binaries produced on your system and check that the assembly is indeed the same. If it is, it's not a rustc issue :-) Try compiling with |
Btw, the absolute same issue happens if I throw away criterion and use built-in test::Bencher (by same I mean it responds to code-units setting)
Well, its past-midnight in my TZ, so I'll dig into it tomorrow. P.S. |
So as a last thing today, I made a simple binary crate with the following main.rs: pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
let zro: i32 = 0; // substitute let with const for slow version and vice versa
let prefixes = arr
.iter()
.enumerate()
.scan((0, 0), |s, (i, v)| {
if s.1 > zro {
s.1 = s.1 + *v;
} else {
*s = (i, *v);
}
Some(*s)
});
let (right_idx, (left_idx, sum)) = prefixes
.enumerate()
.max_by_key(|&(_, (_, sum))| sum)
.unwrap();
(left_idx, right_idx + 1, sum)
}
fn main() {
const N: usize = 1000000;
let v = vec![0; N];
const loops: usize = 1000;
let mut total_sum = 0;
for i in 0..loops {
let (_, _, sum) = max_subarray_bad(&v);
total_sum += sum;
}
println!("{}", total_sum);
} Then I compiled two versions with let and without and run
And for variant with let:
As you can see the difference is significant and reproducable all the way |
Good news: I localized that with debug=true, opt-level=3 and overflow-checks=false the issue is reproducable in my last code. If overflow-checks=true the issue isn't reproducable. So I'm glad that I would be able to disassemble with debug-info tomorrow. Edit: I was shocked by codegen-units affecting it.
For codegen-units=1
10 times performance drop! |
I tried it too on my Linux notebook. If I use With the constant it tests directly against zero, with the variable it compares the value with a register which holds the variable (this is unlikely to cause the difference, but it probably triggered a different codegen for something else). However, since it produces equivalent code with |
@Kobzol try this snippet of main.rs: pub fn max_subarray_bad(arr: &[i32]) -> (usize, usize, i32)
{
let zro: i32 = 0;
let prefixes = arr
.iter()
.enumerate()
.scan((0, 0), |s, (i, v)| {
if s.1 > zro {
s.1 = s.1 + *v;
} else {
*s = (i, *v);
}
Some(*s)
});
let (right_idx, (left_idx, sum)) = prefixes
.enumerate()
.max_by_key(|&(_, (_, sum))| sum)
.unwrap();
(left_idx, right_idx + 1, sum)
}
fn main() {
const N: usize = 1000000;
let v = vec![0; N];
const loops: usize = 2000;
let mut total_sum = 0;
for i in 0..loops {
let (_, _, sum) = max_subarray_bad(&v);
total_sum += sum;
}
println!("{}", total_sum);
} With:
and then set
Run the binary with time command, you'll see how drastically execution time changes without editing the source code. Thats clearly a bug |
Well yes, when you change the number of codegen units, you are affecting the optimizer (more units = more parallelization, but possibly worse code). So if you increase the unit number (or use the default), the generated code can be slower, that is to be expected. For maximum performance you should use |
@Kobzol the codegen-units=16 results in a code that runs in 2-3 seconds, the codegen-units=1 results in a code that runs in 20+ seconds. How this can be a maximum performance? |
On my system I get the following numbers (in
|
Ok, so what about
Or no guarantees can be asserted for debug=true performance? |
If you just want to get debug information in release, use [profile.release]
debug=true
codegen-units=1 Can you fill and send the table I sent on your computer in |
I can, but it has no point for effort, I just checked, my results has the same ratios as yours, just different magnitude. So you're right it's not a bug at all, it's just me as newbie that didn't know about |
I would leave it open for now - I'm not from the rustc team, I just wanted to find out what causes the difference so that we have a better idea of the real problem. Maybe some of the team members might see it differently and indeed consider this a bug :) |
Ha, in any case thank you, I learned some new things about rust, which is good, at least for me :) Well, I'll agree with you and let rustc team judge about what to do with this issue further. |
Today I found that I was sidetracked yesterday and lost few important things. I returned back to the setup repo for this issue and some things still bother me.
And for
But things start work as intended if I try to isolate things in release binary. So the question for today is either bench isn't performance reliable as intended or these things can happen on best optimized release too. P.S. @Kobzol what do you think about inverse performance dependency on codegen-units for bench? |
So on my PC codegen units here doesn't really affect anything for the What I think happens is that when the code was in the binary, with What I don't understand though is why is the code the same on godbolt, but different on my PC (using the same compiler version and hopefully the same flags). If I put |
This is quite unexpected since more codegen units often lead to worse assembly: #47745
Godbolt is using |
Yes I think you're right, I also saw a forum post a while ago where someone was surprised that with |
I created a workspace setup, where binary depends on library (not sure about what kind of linkage actually happens there, but its a default one for workspaces). So, the results are simillar to |
Triage: There was a "maybe we should close this issue?" discussion, and I think we should go ahead and close this now. Especially #68632 (comment) indicates we can close this. Keeping it open is unlikely to lead to anything. |
I asked about it on user.rust-lang.org and one user suggested that it may be a bug in optimizer indeed. So I should state that the things below are reproducable both in nightly and stable latest rust on x86_64-pc-windows-msvc triplet and x86_64-unknown-linux-gnu (tested inside WSL).
So our test subject would be the following snippet, that solves Max subarray problem
If we benchmark it with criterion crate with this benchmark code:
Then the output of
cargo bench
on my machine would beBut with the slight change of moving out 0 in expression
s.1 > 0
in let binding outside of the closure can make a great difference. So the function is now this:But
cargo bench
output indicates almost 20% performance gain!You can check that changing function back and forth with replacing 0 and zro in that expression indeed results in 20% performance change.
By the way, if we change
let zro = 0
intoconst zro: i32 = 0
it results in performance drop too.It looks like a bug in optimizer for me. Could someone verify it?
The text was updated successfully, but these errors were encountered: