-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Optimize core::ptr::align_offset #68616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
cc @rkruppe |
The job Click to expand the log.
I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact |
I have no idea what this API is nor any idea what the implementation is doing. @nagisa I believe you wrote the original implementation, would you be able to review? |
r? @nagisa, yes I’m effectively the owner of this code. @amosonn please include some benchmarks proving the improvement. This PR adds couple of Here are some latencies/reciprocal throughputs of
And then there’s this kind of lowering on architectures that don’t have a native instruction for this op: addiu $1, $4, -1
not $2, $4
and $1, $2, $1
srl $2, $1, 1
lui $3, 21845
ori $3, $3, 21845
and $2, $2, $3
subu $1, $1, $2
lui $2, 13107
ori $2, $2, 13107
and $3, $1, $2
srl $1, $1, 2
lui $4, 3855
and $1, $1, $2
ori $2, $4, 3855
addu $1, $3, $1
srl $3, $1, 4
addu $1, $1, $3
and $1, $1, $2
sll $2, $1, 8
addu $2, $2, $1
sll $3, $1, 16
addu $2, $3, $2
sll $1, $1, 24
addu $1, $1, $2 |
@nagisa I'm not exactly sure how to write and add benchmarks, can you point me to some relevant code I could model? I was also unsure about these intructions; however, even if the decision is to avoid them, the first commit is clearly just optimizing, as it saves an AND operatron on each iteration, and perhaps shaves one (or more) entire iterations, without any introduced cost. Now that I think about it, I can consolidate the |
src/libcore/ptr/mod.rs
Outdated
/// Multiplicative modular inverse table modulo 2⁴ = 16. | ||
/// | ||
/// Note, that this table does not contain values where inverse does not exist (i.e., for | ||
/// `0⁻¹ mod 16`, `2⁻¹ mod 16`, etc.) | ||
const INV_TABLE_MOD_16: [u8; 8] = [1, 11, 13, 7, 9, 3, 5, 15]; | ||
/// Modulo for which the `INV_TABLE_MOD_16` is intended. | ||
const INV_TABLE_MOD: usize = 16; | ||
/// INV_TABLE_MOD² | ||
const INV_TABLE_MOD_SQUARED: usize = INV_TABLE_MOD * INV_TABLE_MOD; | ||
/// $s$ such that INV_TABLE_MOD = $2^s$. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using normal backtick to document code. Latex seems to be nice and expressive but they don't look
well on plain code.
/// $s$ such that INV_TABLE_MOD = $2^s$. | |
/// `s` such that `INV_TABLE_MOD = 2^s`. |
Again, this loop is
You can read about benchmarks a little here: https://doc.rust-lang.org/nightly/unstable-book/library-features/test.html. You could also use the https://docs.rs/criterion/0.3.1/criterion/ library to implement the benchmark. What I am expecting here is a stand-alone bench-test comparing the new and old code with same kinds of inputs. Copy the implementations of these functions as necessary to make a stand-alone benchmark so you don’t need to depend on the implementation inside libstd. |
I agree, the removal of Anyway, I'll try to write a benchmark, and we'll see what happens. |
Here is a repo for running the benchmark: A print of a quick run on my machine:
Note that I ran this benchmark starting after my first commit to the repo. With the intrinsic for The main effect in this optimization is clearly the reduced iterations (2nd commit) (as seen from the jumps between where the old impl has one more iteration, e.g. 128, 2048 or 131072, to where they have the same number of iterations, e.g. 256, 4096, 1048576. However, even in those last ones, where they have the same number of iterations, the saving of one multiplication each iteration (3rd commit) is also substantial.
Edit: all of the strange results were due to |
faa82f3
to
fef6176
Compare
Update on the last commit:
So, the currently included 4th commit here is the one which has the best performance (at least on my machine), but it isn't clear to me why. @nagisa , do you think we should include it anyway? |
@amosonn I’ll see about running these benchmarks on more varied machines and architectures (mips? ARM? PPC? that kind of stuff) over the weekend and we’ll see. FWIW I took a brief look at the benchmark code and I think it is lacking one critical test case – when the alignment is constant[¹]. Constant alignment is the most common use-case in practice and is also what improves the runtime properties of this function the most. With it constant the compiler optimises out majority of the code in this function. [¹]: I suspect that criterion will be helpful and will blackbox the argument that you are varying right now. |
Final benches:
|
@nagisa I'm not sure I understand what you mean by constant alignment. Each bench run already checks only one specific alignment. Or did you mean going over varying offsets with the same alignment? As for the choice of struct size and "ptr address", I chose the ones which highlight the optimization benefit of reducing the modulo (2nd commit). All the other optimizations should be just as present with other choices, though. Edit: Ok, I understood now. I will add this to the code, and also attempt to keep the stride constant and see what happens there (seems an even more standard use-case, as this is type-parametrized). Currently I checked with constant alignment: The difference between v0 and v1 is still huge (probably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for preparing the benchmarks. I had to adjust them a little to ensure LLVM doesn’t just call the same code regardless of whether the arguments are constant or not. In order to save my time I also reduced the number of test cases somewhat to the more interesting ones and instead added a corner case that #[packed]
structures would exercise (p = 3, stride = 5
).
Here are some of the results I could gather for more platforms and they do indeed look great.
Please fix up the comments and other nits I pointed out inline and we can merge this. Thanks for working on this!
src/libcore/ptr/mod.rs
Outdated
/// INV_TABLE_MOD² | ||
const INV_TABLE_MOD_SQUARED: usize = INV_TABLE_MOD * INV_TABLE_MOD; | ||
const INV_TABLE_MOD: usize = 1 << INV_TABLE_MOD_POW; | ||
/// `s` such that `INV_TABLE_MOD == 2^s`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please describe what this constant below means in relation to INV_TABLE_MOD_16
. You can remove documentation for other constants as they effectively end up duplicating their implementation in documentation and are just noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/libcore/ptr/mod.rs
Outdated
/// | ||
/// This implementation is tailored for align_offset and has following preconditions: | ||
/// | ||
/// * `m` is a power-of-two; | ||
/// * The requested modulu `m` is a power-of-two, so `mpow` can be an argument; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/modulu/modulo/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/libcore/ptr/mod.rs
Outdated
/// * `x < m`; (if `x ≥ m`, pass in `x % m` instead) | ||
/// | ||
/// It also leaves reducing the result modulu `m` to the caller, so the result may be larger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/modulu/modulo/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/libcore/ptr/mod.rs
Outdated
// This branch solves for the following linear congruence equation: | ||
// | ||
// $$ p + so ≡ 0 mod a $$ | ||
// | ||
// $p$ here is the pointer value, $s$ – stride of `T`, $o$ offset in `T`s, and $a$ – the | ||
// requested alignment. | ||
// | ||
// g = gcd(a, s) | ||
// o = (a - (p mod a))/g * ((s/g)⁻¹ mod a) | ||
// With $g = gcd(a, s)$$, and the above asserting that $p$ is also divisible by $g$, we can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets get rid of the $
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did so in all comments. In the meantime, replaced all the non-ascii characters, they broke at least my editor :).
I also split the commits a little finer, so it's easier to see the progression. I guess this will all be squashed later anyway? |
src/libcore/ptr/mod.rs
Outdated
} else { | ||
// We iterate "up" using the following formula: | ||
// | ||
// $$ xy ≡ 1 (mod 2ⁿ) → xy (2 - xy) ≡ 1 (mod 2²ⁿ) $$ | ||
// ` xy = 1 (mod 2^n) -> xy (2 - xy) = 1 (mod 2^2n) ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// ` xy = 1 (mod 2^n) -> xy (2 - xy) = 1 (mod 2^2n) ` | |
// ` xy = 1 (mod 2^n) -> xy (2 - xy) = 1 (mod 2^(2n)) ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks for looking at all of this! Regarding your benchmarking: I think it would be interesting to do some more comparisons, at least for a few cases such as v2 vs v4. As previously noted, the difference between v0 and v1 of removing the Going on, the story becomes different. From v2 to v3, we transform our loop to perform a This seems surprising at best and brittle at worse, so I would at least like to get more convinced that other platforms behave similarly, which should mean that there is a "reason" why this is the best implementation, even if we don't understand it. If they do not, i.e. v4 performs worse than v2 on some platforms, this means that this part of the optimization is too "magical", and we should probably stick to v2 (currently the first three commits). If you are willing to put more time into running those benchmarks, I can write a comprehensive comparison of the relevant parts. Otherwise, I would suggest merging the first 3 commits for now, and leaving the rest for a later PR. Edit: BTW, FWIW, I think the case of the |
src/libcore/ptr/mod.rs
Outdated
let a2minus1 = a2.wrapping_sub(1); | ||
let s2 = smoda >> gcdpow; | ||
let minusp2 = a2.wrapping_sub(pmoda >> gcdpow); | ||
// mod_pow_2_inv returns a result which may be out of $a'$-s range, but it's fine to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick here and line below
// mod_pow_2_inv returns a result which may be out of $a'$-s range, but it's fine to | |
// mod_pow_2_inv returns a result which may be out of `a'`-s range, but it's fine to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :)
I did explicitly account for that in my adaptation of the benchmarks (where I made the stride a const generic).
Yeah, lets land the simple and most impactful changes first (the first three commits) and consider the other ones in isolation. I did some assembly spelunking while benchmarks were running (
This makes me wonder if the complexity in reasoning that EDIT: I can confirm your own results that there is a significant improvement in observed throughput between |
- Stopping condition inside mod_inv can be >= instead of > - Remove intrinsics::unchecked_rem, we are working modulu powers-of-2 so we can simply mask
- When calculating the inverse, it's enough to work `mod a/g` instead of `mod a`.
- As explained in the comment inside mod_inv, it is valid to work mod `usize::max_value()` right until the end.
- Instead of squaring the modulu until it is larger than the required one, we double log2 of it. This means a shift instead of mul each iteration.
- Pass mask as arg to mod_pow_2_inv instead of calculating it again.
- Remove redundant masking from mod_pow_2_inv, caller site already takes care of this (after mul): according to benchmarking, the best performance is acheived when this masking is present in the fast branch (for small modulos), but absent from the slow branch.
Ok, in attempting to trim this down to the three first commits, I got into a fight with github and closed this :) Here's the new PR: #68787 |
Optimize core::ptr::align_offset (part 1) r? @nagisa See rust-lang#68616 for main discussion.
usize::max_value()
right until the end.of
one, we calculate the number of required iterations.