-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Special-case deriving PartialOrd
for enums with dataless variants
#103659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(rustbot has picked a reviewer for you, use r? to override) |
@bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 4440b9954cd728433ce6232af319c79393984e9a with merge 7d4ed6a1e86edcf57e09c298a239018705744284... |
☀️ Try build successful - checks-actions |
1 similar comment
☀️ Try build successful - checks-actions |
Queued 7d4ed6a1e86edcf57e09c298a239018705744284 with parent 0da281b, future comparison URL. |
Can't this fairly easily (at the risk of uglier code) be generalized to enums with one dataful variant and an arbitrary number of dataless variants? The rest of the variants would be accounted for by comparing the tags just as you've done. |
Finished benchmarking commit (7d4ed6a1e86edcf57e09c298a239018705744284): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Footnotes |
On further testing I'm actually able to get better codegen by applying this to all enums with any dataless variant (Old, New - output of |
Perf run requested in discord. @bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit 4c1bc43887286d919814c24ec5d7048351fcf78a with merge f58bb57191d1311a4ed7e7a0a77b057d32c27550... |
☀️ Try build successful - checks-actions |
Queued f58bb57191d1311a4ed7e7a0a77b057d32c27550 with parent a9ef100, future comparison URL. |
Finished benchmarking commit (f58bb57191d1311a4ed7e7a0a77b057d32c27550): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesThis benchmark run did not return any relevant results for this metric. Footnotes |
PartialOrd
for two-variant enumsPartialOrd
for enums with dataless variants
sorry for not getting around to reviewing this, gonna re-roll r? compiler |
Is the Unfortunately, I don’t know what the ideal cut-over point may be, but the characteristics between the two approaches will always have different tradeoffs depending on what data is passed in. Are these just different variants? In that case the tag check first should ~always be faster. Are the data in the same variant different? A match may be faster then, since the tag check is always going to pass anyway. No padding? This is all going to be further complicated by LLVM having its own heuristics for what kind of algorithm is appropriate for a switch – a table, a jump tree or something else. With that in mind, demonstration that the improvement is ~consistent over |
I will try some more comprehensive benchmarks so we can get some more data on what gives us the best speed |
@rustbot author |
I wrote a script to generate some code and some fairly rough benchmarks, but the numbers look a bit more promising. I tested many permutations - for each one, comparing
Times are in ms, and are given as 0 if the test was not applicable to the type of enum.
|
Thanks for preparing those. A couple of caveats, the times are actually nanoseconds, and thus running just 50 iterations of each is probably not going to be enough to gain any confidence in the measurement. I went on to modify the generator a little: to increase the iteration count, and to add some ad-hoc test cases with a large number of variants at once. Modified generator scriptimport string
variant_names = [f"V{i}" for i in range(1024)]
def make_name(pat):
return ''.join(["F" if x else "E" for x in pat])
def make_enum(pat):
name = make_name(pat)
buf = "#[cfg_attr(derived, derive(PartialOrd))]\n"
buf += "#[derive(PartialEq)]\n"
buf += f"pub enum {name}" + " {\n"
for var, x in zip(variant_names, pat):
buf += ' ' + var
if x:
buf += "(usize)"
buf += ",\n"
buf += "}\n"
buf += "#[cfg(not(derived))]\n"
buf += f"impl PartialOrd for {name} {{\n"
buf += " #[inline]\n"
buf += " fn partial_cmp(&self, other: &Self) -> Option<core::cmp::Ordering> {\n"
buf += " let l_tag = ::core::intrinsics::discriminant_value(self);\n"
buf += " let r_tag = ::core::intrinsics::discriminant_value(other);\n"
buf += " match (self, other) {\n"
for var, x in zip(variant_names, pat):
if x:
buf += f" (Self::{var}(l), Self::{var}(r)) => PartialOrd::partial_cmp(l, r),\n"
buf += " _ => PartialOrd::partial_cmp(&l_tag, &r_tag),\n"
buf += " }\n"
buf += " }\n"
buf += "}\n"
return buf
out = """#![feature(core_intrinsics, bench_black_box)]
#![allow(non_snake_case, unreachable_patterns)]
use std::hint::black_box;
extern crate core;
"""
# Test cases: Two equal dataless
# Two different dataless
# Two equal dataful with same data
# Two equal dataful with different data
def make_tests(pat):
name = make_name(pat)
buf = f"fn test_{name}() -> (u128, u128, u128, u128)" + " {\n"
buf += " let one = {"
dataless = [name for name, x in zip(variant_names, pat) if not x]
dataful = [name for name, x in zip(variant_names, pat) if x]
if not dataless:
buf += "0\n"
else:
variant = dataless[0]
buf += f"let (l, r) = ({name}::{variant}, {name}::{variant});" + '\n'
buf += "let now = std::time::Instant::now();\n"
buf += "for _ in 0..500000 {black_box(PartialOrd::partial_cmp(black_box(&l), black_box(&r)));}\n"
buf += "now.elapsed().as_nanos()\n"
buf += " };\n"
buf += " let two = {"
if len(dataless) < 2:
buf += "0\n"
else:
var_l, var_r = dataless[:2]
buf += f"let (l, r) = ({name}::{var_l}, {name}::{var_r});" + '\n'
buf += "let now = std::time::Instant::now();\n"
buf += "for _ in 0..500000 {black_box(PartialOrd::partial_cmp(black_box(&l), black_box(&r)));}\n"
buf += "now.elapsed().as_nanos()\n"
buf += " };\n"
buf += " let three = {"
if not dataful:
buf += "0\n"
else:
variant = dataful[0]
buf += f"let (l, r) = ({name}::{variant}(10), {name}::{variant}(10));" + '\n'
buf += "let now = std::time::Instant::now();\n"
buf += "for _ in 0..500000 {black_box(PartialOrd::partial_cmp(black_box(&l), black_box(&r)));}\n"
buf += "now.elapsed().as_nanos()\n"
buf += " };\n"
buf += " let four = {"
if len(dataful) < 2:
buf += "0\n"
else:
var_l, var_r = dataful[:2]
buf += f"let (l, r) = ({name}::{var_l}(10), {name}::{var_r}(10));" + '\n'
buf += "let now = std::time::Instant::now();\n"
buf += "for _ in 0..500000 {black_box(PartialOrd::partial_cmp(black_box(&l), black_box(&r)));}\n"
buf += "now.elapsed().as_nanos()\n"
buf += " };\n"
buf += " (one, two, three, four)\n"
buf += "}\n"
return buf
names = []
for i in range(50):
pat = [int(x) for x in f"{i:b}"]
names.append(make_name(pat))
out += make_enum(pat) + '\n'
out += make_tests(pat)
for pat in [0b10101010101010101010101, 1<<250 | 1 << 100, (1<<500) - 1, ((1<<500)-1) ^ (1<<100) ^ (1<<200)]:
pat = [int(x) for x in f"{pat:b}"]
name = make_name(pat)
names.append(name)
out += make_enum(pat) + '\n'
out += make_tests(pat)
out += "fn main() {\n"
out += 'println!("name,equal_dataless,inequal_dataless,equal_dataful,inequal_dataful,total_ns");\n'
for name in names:
out += f"let (a, b, c, d) = test_{name}();\n"
out += f'println!("{name},{{a}},{{b}},{{c}},{{d}},{{}}", a+b+c+d);\n'
out += "}\n"
print(out) The results that I got on my somewhat noisy server can be seen in this spreadsheet (sorry; non-free software T_T) Looking at that my conclusion is roughly that the actual difference at least on my machine is rare, but where there is a difference, it is roughly positive. (I also checked the results without optimizations and the results are similarly largely neutral.) I’m still worried about the fact that these benchmarks are the best-case scenario for the branch predictor, but I don’t think its going to be easy to write a benchmark that exercises it in a meaningfully different ways. With that in mind, I’m quite happy to merge the algorithm as benchmarked. |
515ec74
to
a8fb1e1
Compare
Squashed and rebased onto master as well as fixing tests @rustbot ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had these queued since my last review but for some reason never got to submitting them…
a8fb1e1
to
750ea4f
Compare
Applied the formatting suggestions as well as a link back to the benchmarks for future reference |
750ea4f
to
2883148
Compare
@bors r+ Sorry for taking a long time to get back to this! |
☀️ Test successful - checks-actions |
Finished benchmarking commit (9f82651): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. |
@clubby789 I see you've made multiple improvements to deriving code. I did a bunch of work last year on that code and would be happy to be CC'd on any future changes you make to that code. Thanks! |
I was able to get slightly better codegen by flipping the derived
PartialOrd
logic for two-variant enums. I also tried to document the implementation of the derive macro to make the special-case logic a little clearer.Godbolt: Current, New
I'm not sure how common a case comparing two enums like this (such as
Option
) is, and if it's worth the slowdown of adding a special case to the derive. If it causes overall regressions it might be worth just manually implementing this forOption
.