Skip to content

Optimize core::ptr::align_offset #68616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

Conversation

amosonn
Copy link
Contributor

@amosonn amosonn commented Jan 28, 2020

  • As explained in the comment inside mod_inv, it is valid to work mod
    usize::max_value() right until the end.
  • When calculating the inverse, it's enough to work $mod a/g$ instead
    of $mod a$.
  • Instead of squaring the modulu until it is larger than the required
    one, we calculate the number of required iterations.

@rust-highfive
Copy link
Contributor

r? @alexcrichton

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 28, 2020
@Centril
Copy link
Contributor

Centril commented Jan 28, 2020

cc @rkruppe

@rust-highfive
Copy link
Contributor

The job x86_64-gnu-llvm-7 of your PR failed (pretty log, raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.
2020-01-28T21:55:54.7082886Z ========================== Starting Command Output ===========================
2020-01-28T21:55:54.7084880Z [command]/bin/bash --noprofile --norc /home/vsts/work/_temp/d9b21cd8-feae-4e1f-bada-6990b98c3a4a.sh
2020-01-28T21:55:54.7084906Z 
2020-01-28T21:55:54.7087019Z ##[section]Finishing: Disable git automatic line ending conversion
2020-01-28T21:55:54.7091093Z ##[section]Starting: Checkout rust-lang/rust@refs/pull/68616/merge to s
2020-01-28T21:55:54.7092221Z Task         : Get sources
2020-01-28T21:55:54.7092247Z Description  : Get sources from a repository. Supports Git, TfsVC, and SVN repositories.
2020-01-28T21:55:54.7092305Z Version      : 1.0.0
2020-01-28T21:55:54.7092331Z Author       : Microsoft
---
2020-01-28T21:55:55.5260508Z ##[command]git remote add origin https://github.com/rust-lang/rust
2020-01-28T21:55:55.5270378Z ##[command]git config gc.auto 0
2020-01-28T21:55:55.5272927Z ##[command]git config --get-all http.https://github.com/rust-lang/rust.extraheader
2020-01-28T21:55:55.5274216Z ##[command]git config --get-all http.proxy
2020-01-28T21:55:55.5446480Z ##[command]git -c http.extraheader="AUTHORIZATION: basic ***" fetch --force --tags --prune --progress --no-recurse-submodules --depth=2 origin +refs/heads/*:refs/remotes/origin/* +refs/pull/68616/merge:refs/remotes/pull/68616/merge

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@alexcrichton
Copy link
Member

I have no idea what this API is nor any idea what the implementation is doing. @nagisa I believe you wrote the original implementation, would you be able to review?

@nagisa
Copy link
Member

nagisa commented Jan 29, 2020

r? @nagisa, yes I’m effectively the owner of this code.

@amosonn please include some benchmarks proving the improvement. This PR adds couple of bsf/cttz instructions which aren’t exactly the cheapest instruction around, especially as you move away from modern designs. And there are two of them added just to replace a loop that will in practice only iterate once or twice for the common inputs if it isn’t already const-evaluated. This loop is otherwise log(log(n)) too…

Here are some latencies/reciprocal throughputs of cttz for common nowadays modern architectures:

        µops | latency | recip.throughput
ryzen     6  |    3    |        3
haswell   1  |    3    |        1
skylake   1  |    3    |        1
power9    1  |    3    |       0.25
apple A13 2  |  1 + 1  |   0.25 + 0.25

And then there’s this kind of lowering on architectures that don’t have a native instruction for this op:

        addiu   $1, $4, -1
        not     $2, $4
        and     $1, $2, $1
        srl     $2, $1, 1
        lui     $3, 21845
        ori     $3, $3, 21845
        and     $2, $2, $3
        subu    $1, $1, $2
        lui     $2, 13107
        ori     $2, $2, 13107
        and     $3, $1, $2
        srl     $1, $1, 2
        lui     $4, 3855
        and     $1, $1, $2
        ori     $2, $4, 3855
        addu    $1, $3, $1
        srl     $3, $1, 4
        addu    $1, $1, $3
        and     $1, $1, $2
        sll     $2, $1, 8
        addu    $2, $2, $1
        sll     $3, $1, 16
        addu    $2, $3, $2
        sll     $1, $1, 24
        addu    $1, $1, $2

@rust-highfive rust-highfive assigned nagisa and unassigned alexcrichton Jan 29, 2020
@amosonn
Copy link
Contributor Author

amosonn commented Jan 29, 2020

@nagisa I'm not exactly sure how to write and add benchmarks, can you point me to some relevant code I could model?

I was also unsure about these intructions; however, even if the decision is to avoid them, the first commit is clearly just optimizing, as it saves an AND operatron on each iteration, and perhaps shaves one (or more) entire iterations, without any introduced cost.

Now that I think about it, I can consolidate the cttz which happens in the main function with the one in the mod_inv (passing the power instead of the modulu itself), and then lower the squaring to multiplication by 2, which is good enough. I'll draft this later.

/// Multiplicative modular inverse table modulo 2⁴ = 16.
///
/// Note, that this table does not contain values where inverse does not exist (i.e., for
/// `0⁻¹ mod 16`, `2⁻¹ mod 16`, etc.)
const INV_TABLE_MOD_16: [u8; 8] = [1, 11, 13, 7, 9, 3, 5, 15];
/// Modulo for which the `INV_TABLE_MOD_16` is intended.
const INV_TABLE_MOD: usize = 16;
/// INV_TABLE_MOD²
const INV_TABLE_MOD_SQUARED: usize = INV_TABLE_MOD * INV_TABLE_MOD;
/// $s$ such that INV_TABLE_MOD = $2^s$.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using normal backtick to document code. Latex seems to be nice and expressive but they don't look
well on plain code.

Suggested change
/// $s$ such that INV_TABLE_MOD = $2^s$.
/// `s` such that `INV_TABLE_MOD = 2^s`.

@nagisa
Copy link
Member

nagisa commented Jan 29, 2020

the first commit is clearly just optimizing, as it saves an AND operatron on each iteration

Again, this loop is O(log log n) and for all interesting values in practice will execute only one (any alignment up-to 256) or two iterations (alignments up-to 64k), if it isn't already const-evaluated away entirely in the first place. Considering that and is one of the cheapest instructions around I'm really curious to see the actual impact of this change.


I'm not exactly sure how to write and add benchmarks, can you point me to some relevant code I could model?

You can read about benchmarks a little here: https://doc.rust-lang.org/nightly/unstable-book/library-features/test.html. You could also use the https://docs.rs/criterion/0.3.1/criterion/ library to implement the benchmark.

What I am expecting here is a stand-alone bench-test comparing the new and old code with same kinds of inputs. Copy the implementations of these functions as necessary to make a stand-alone benchmark so you don’t need to depend on the implementation inside libstd.

@amosonn
Copy link
Contributor Author

amosonn commented Jan 30, 2020

I agree, the removal of and is probably not much of a save. The replacement of the rem intrinsic with an and should be, however; the possible saving of an iteration (which has a few multiplications) as well; and also the lowering of the "counter" arithmetic which replaces a mul with a shr.

Anyway, I'll try to write a benchmark, and we'll see what happens.

@amosonn
Copy link
Contributor Author

amosonn commented Jan 30, 2020

Here is a repo for running the benchmark:
https://github.com/amosonn/bench_align_offset

A print of a quick run on my machine:

# (ptr, stride, align)
args/old/(8, 24, 16)      time:   [9.4774 ns 9.4901 ns 9.5044 ns]
args/new/(8, 24, 16)      time:   [7.2251 ns 7.2354 ns 7.2470 ns]
args/old/(8, 24, 32)      time:   [11.833 ns 11.858 ns 11.887 ns]
args/new/(8, 24, 32)      time:   [7.2260 ns 7.2360 ns 7.2477 ns]
args/old/(8, 24, 64)      time:   [11.843 ns 11.881 ns 11.922 ns]
args/new/(8, 24, 64)      time:   [7.2254 ns 7.2451 ns 7.2686 ns]
args/old/(8, 24, 128)     time:   [11.795 ns 11.806 ns 11.818 ns]
args/new/(8, 24, 128)     time:   [7.2463 ns 7.2587 ns 7.2722 ns]
args/old/(8, 24, 256)     time:   [11.812 ns 11.827 ns 11.844 ns]
args/new/(8, 24, 256)     time:   [9.1618 ns 9.1766 ns 9.1940 ns]
args/old/(8, 24, 512)     time:   [14.072 ns 14.091 ns 14.112 ns]
args/new/(8, 24, 512)     time:   [9.1743 ns 9.1964 ns 9.2242 ns]
args/old/(8, 24, 1024)    time:   [14.073 ns 14.089 ns 14.108 ns]
args/new/(8, 24, 1024)    time:   [9.1550 ns 9.1652 ns 9.1768 ns]
args/old/(8, 24, 2048)    time:   [14.074 ns 14.086 ns 14.099 ns]
args/new/(8, 24, 2048)    time:   [9.1643 ns 9.1789 ns 9.1954 ns]
args/old/(8, 24, 4096)    time:   [14.092 ns 14.110 ns 14.131 ns]
args/new/(8, 24, 4096)    time:   [11.289 ns 11.309 ns 11.333 ns]
args/old/(8, 24, 131072)  time:   [16.521 ns 16.545 ns 16.572 ns]
args/new/(8, 24, 131072)  time:   [11.276 ns 11.286 ns 11.297 ns]
args/old/(8, 24, 1048576) time:   [16.527 ns 16.560 ns 16.596 ns]
args/new/(8, 24, 1048576) time:   [13.310 ns 13.330 ns 13.352 ns]

Note that I ran this benchmark starting after my first commit to the repo. With the intrinsic for rem it took around an extra 8 ns for each action, which dwarfed any other optimization; and the possible extra iteration due to incorrect bounds checking also added noise.

The main effect in this optimization is clearly the reduced iterations (2nd commit) (as seen from the jumps between where the old impl has one more iteration, e.g. 128, 2048 or 131072, to where they have the same number of iterations, e.g. 256, 4096, 1048576. However, even in those last ones, where they have the same number of iterations, the saving of one multiplication each iteration (3rd commit) is also substantial.

However, comparing before and after the 4th commit (which removed redundant masking inside mod_inv), it seemed to actually hurt the performance:

args/old/(8, 24, 16)      time:   [4.9234 ns 4.9342 ns 4.9463 ns]
args/new/(8, 24, 16)      time:   [7.2250 ns 7.2363 ns 7.2501 ns]
args/old/(8, 24, 32)      time:   [5.0661 ns 5.1355 ns 5.2158 ns]
args/new/(8, 24, 32)      time:   [7.2724 ns 7.2971 ns 7.3299 ns]
args/old/(8, 24, 64)      time:   [4.9248 ns 4.9422 ns 4.9637 ns]
args/new/(8, 24, 64)      time:   [7.2340 ns 7.2462 ns 7.2603 ns]
args/old/(8, 24, 128)     time:   [4.9219 ns 4.9440 ns 4.9707 ns]
args/new/(8, 24, 128)     time:   [7.2273 ns 7.2357 ns 7.2453 ns]
args/old/(8, 24, 256)     time:   [7.5179 ns 7.5548 ns 7.5983 ns]
args/new/(8, 24, 256)     time:   [9.1065 ns 9.1332 ns 9.1719 ns]
args/old/(8, 24, 512)     time:   [7.4731 ns 7.4870 ns 7.5025 ns]
args/new/(8, 24, 512)     time:   [9.1042 ns 9.1192 ns 9.1368 ns]
args/old/(8, 24, 1024)    time:   [7.4763 ns 7.4915 ns 7.5085 ns]
args/new/(8, 24, 1024)    time:   [9.1384 ns 9.1587 ns 9.1813 ns]
args/old/(8, 24, 2048)    time:   [7.4971 ns 7.5227 ns 7.5514 ns]
args/new/(8, 24, 2048)    time:   [9.0994 ns 9.1097 ns 9.1220 ns]
args/old/(8, 24, 4096)    time:   [9.6015 ns 9.6170 ns 9.6342 ns]
args/new/(8, 24, 4096)    time:   [11.212 ns 11.247 ns 11.289 ns]
args/old/(8, 24, 131072)  time:   [9.5954 ns 9.6229 ns 9.6548 ns]
args/new/(8, 24, 131072)  time:   [11.192 ns 11.210 ns 11.231 ns]
args/old/(8, 24, 1048576) time:   [11.864 ns 11.904 ns 11.947 ns]
args/new/(8, 24, 1048576) time:   [13.331 ns 13.365 ns 13.405 ns]

So I suspected that multiplying small numbers is quicker than multiplying large ones. However, if I re-added some masking on each iteration (not with the original running_mod, which we don't know, but with the final mask), this hurt performance; and if I removed the masking from the return value in the fast case (mpow <= 4), this boosted it for the slow case, but hurt it slightly for the fast one! There's some optimizer magic going on, I'll try to delve into it; in the meantime, though, the 3 first commits seem solid to merge to me.

Edit: all of the strange results were due to a2 - 1 instead of a2.unchecked_sub(1) in the end, which sometimes the compiler optimized away and sometimes it didn't. With this fixed, the 4th commit (removing the and-s) hurts performance in fast case and improves it in the slow case, both insubstantially but consistently. I'll investigate this further, but it'll probably be fine to drop this change.

@amosonn amosonn force-pushed the patch-1 branch 2 times, most recently from faa82f3 to fef6176 Compare January 31, 2020 00:55
@amosonn
Copy link
Contributor Author

amosonn commented Jan 31, 2020

Update on the last commit:

  • Removing the and from the slow path (return inside the iteration) improves performance. This is not surprising.
  • Removing the and from the fast path (m <= 16) hurts performance; looking at the assembly, it seems that this and causes the compiler to change the conditional branch in a way which is favorable to the fast path. This doesn't seem something which we should rely on, however removing this and just for the sake of it seems also wrong.
  • If this and stays, fetching the mask from outside instead of re-computing it also improves performance; however, if it stays in both cases, fetching the mask from the outside hurt performance on the fast path while improving it on the slow path; I could not explain this even looking at the assembly.

So, the currently included 4th commit here is the one which has the best performance (at least on my machine), but it isn't clear to me why. @nagisa , do you think we should include it anyway?

@nagisa
Copy link
Member

nagisa commented Jan 31, 2020

@amosonn I’ll see about running these benchmarks on more varied machines and architectures (mips? ARM? PPC? that kind of stuff) over the weekend and we’ll see.

FWIW I took a brief look at the benchmark code and I think it is lacking one critical test case – when the alignment is constant[¹]. Constant alignment is the most common use-case in practice and is also what improves the runtime properties of this function the most. With it constant the compiler optimises out majority of the code in this function.

[¹]: I suspect that criterion will be helpful and will blackbox the argument that you are varying right now.

@amosonn
Copy link
Contributor Author

amosonn commented Jan 31, 2020

Final benches:

args/align_offset_v0/(8, 24,      16) time:   [17.704 ns 17.714 ns 17.726 ns]
args/align_offset_v1/(8, 24,      16) time:   [9.4928 ns 9.5079 ns 9.5266 ns]
args/align_offset_v2/(8, 24,      16) time:   [9.8614 ns 9.8679 ns 9.8755 ns]
args/align_offset_v3/(8, 24,      16) time:   [4.8711 ns 4.8765 ns 4.8843 ns]
args/align_offset_v4/(8, 24,      16) time:   [4.3562 ns 4.3628 ns 4.3716 ns]

args/align_offset_v0/(8, 24,     128) time:   [20.516 ns 20.529 ns 20.545 ns]
args/align_offset_v1/(8, 24,     128) time:   [11.731 ns 11.738 ns 11.747 ns]
args/align_offset_v2/(8, 24,     128) time:   [9.8589 ns 9.8675 ns 9.8788 ns]
args/align_offset_v3/(8, 24,     128) time:   [4.8710 ns 4.8760 ns 4.8834 ns]
args/align_offset_v4/(8, 24,     128) time:   [4.3631 ns 4.3687 ns 4.3751 ns]

args/align_offset_v0/(8, 24,     256) time:   [23.108 ns 23.146 ns 23.193 ns]
args/align_offset_v1/(8, 24,     256) time:   [11.728 ns 11.732 ns 11.738 ns]
args/align_offset_v2/(8, 24,     256) time:   [11.909 ns 11.926 ns 11.949 ns]
args/align_offset_v3/(8, 24,     256) time:   [5.7815 ns 5.7864 ns 5.7935 ns]
args/align_offset_v4/(8, 24,     256) time:   [5.4964 ns 5.5050 ns 5.5154 ns]

args/align_offset_v0/(8, 24,     512) time:   [23.093 ns 23.106 ns 23.121 ns]
args/align_offset_v1/(8, 24,     512) time:   [14.113 ns 14.123 ns 14.135 ns]
args/align_offset_v2/(8, 24,     512) time:   [11.922 ns 11.945 ns 11.971 ns]
args/align_offset_v3/(8, 24,     512) time:   [5.7889 ns 5.8045 ns 5.8232 ns]
args/align_offset_v4/(8, 24,     512) time:   [5.5012 ns 5.5142 ns 5.5301 ns]

args/align_offset_v0/(8, 24,    2048) time:   [23.094 ns 23.131 ns 23.177 ns]
args/align_offset_v1/(8, 24,    2048) time:   [14.110 ns 14.131 ns 14.157 ns]
args/align_offset_v2/(8, 24,    2048) time:   [11.915 ns 11.937 ns 11.963 ns]
args/align_offset_v3/(8, 24,    2048) time:   [5.7853 ns 5.7980 ns 5.8180 ns]
args/align_offset_v4/(8, 24,    2048) time:   [5.5066 ns 5.5185 ns 5.5321 ns]

args/align_offset_v0/(8, 24,    4096) time:   [23.095 ns 23.111 ns 23.132 ns]
args/align_offset_v1/(8, 24,    4096) time:   [14.113 ns 14.124 ns 14.137 ns]
args/align_offset_v2/(8, 24,    4096) time:   [13.861 ns 13.870 ns 13.880 ns]
args/align_offset_v3/(8, 24,    4096) time:   [6.7730 ns 6.7797 ns 6.7895 ns]
args/align_offset_v4/(8, 24,    4096) time:   [6.3811 ns 6.3965 ns 6.4140 ns]

args/align_offset_v0/(8, 24,  131072) time:   [25.976 ns 26.018 ns 26.081 ns]
args/align_offset_v1/(8, 24,  131072) time:   [16.490 ns 16.519 ns 16.557 ns]
args/align_offset_v2/(8, 24,  131072) time:   [13.859 ns 13.866 ns 13.875 ns]
args/align_offset_v3/(8, 24,  131072) time:   [6.7823 ns 6.7983 ns 6.8178 ns]
args/align_offset_v4/(8, 24,  131072) time:   [6.3731 ns 6.3844 ns 6.3965 ns]

args/align_offset_v0/(8, 24, 1048576) time:   [26.292 ns 26.337 ns 26.392 ns]
args/align_offset_v1/(8, 24, 1048576) time:   [16.491 ns 16.501 ns 16.513 ns]
args/align_offset_v2/(8, 24, 1048576) time:   [16.052 ns 16.060 ns 16.069 ns]
args/align_offset_v3/(8, 24, 1048576) time:   [7.9321 ns 7.9426 ns 7.9573 ns]
args/align_offset_v4/(8, 24, 1048576) time:   [7.5099 ns 7.5171 ns 7.5253 ns]

@amosonn
Copy link
Contributor Author

amosonn commented Jan 31, 2020

@nagisa I'm not sure I understand what you mean by constant alignment. Each bench run already checks only one specific alignment. Or did you mean going over varying offsets with the same alignment?

As for the choice of struct size and "ptr address", I chose the ones which highlight the optimization benefit of reducing the modulo (2nd commit). All the other optimizations should be just as present with other choices, though.

Edit: Ok, I understood now. I will add this to the code, and also attempt to keep the stride constant and see what happens there (seems an even more standard use-case, as this is type-parametrized). Currently I checked with constant alignment: The difference between v0 and v1 is still huge (probably intrinsics::rem doesn't get optimized away). The removal of the iteration by reducing the modulo (v1 to v2, when the struct size contains a power of 2) is still noticeable. The rest is becoming quite weak, however there are still slight improvements. v3 is however sometimes worst than v1 (when there's no modulo reduction), so it's preferable to stay with v4.

Copy link
Member

@nagisa nagisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for preparing the benchmarks. I had to adjust them a little to ensure LLVM doesn’t just call the same code regardless of whether the arguments are constant or not. In order to save my time I also reduced the number of test cases somewhat to the more interesting ones and instead added a corner case that #[packed] structures would exercise (p = 3, stride = 5).

Here are some of the results I could gather for more platforms and they do indeed look great.

Please fix up the comments and other nits I pointed out inline and we can merge this. Thanks for working on this!

/// INV_TABLE_MOD²
const INV_TABLE_MOD_SQUARED: usize = INV_TABLE_MOD * INV_TABLE_MOD;
const INV_TABLE_MOD: usize = 1 << INV_TABLE_MOD_POW;
/// `s` such that `INV_TABLE_MOD == 2^s`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe what this constant below means in relation to INV_TABLE_MOD_16. You can remove documentation for other constants as they effectively end up duplicating their implementation in documentation and are just noise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

///
/// This implementation is tailored for align_offset and has following preconditions:
///
/// * `m` is a power-of-two;
/// * The requested modulu `m` is a power-of-two, so `mpow` can be an argument;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/modulu/modulo/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/// * `x < m`; (if `x ≥ m`, pass in `x % m` instead)
///
/// It also leaves reducing the result modulu `m` to the caller, so the result may be larger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/modulu/modulo/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// This branch solves for the following linear congruence equation:
//
// $$ p + so ≡ 0 mod a $$
//
// $p$ here is the pointer value, $s$ – stride of `T`, $o$ offset in `T`s, and $a$ – the
// requested alignment.
//
// g = gcd(a, s)
// o = (a - (p mod a))/g * ((s/g)⁻¹ mod a)
// With $g = gcd(a, s)$$, and the above asserting that $p$ is also divisible by $g$, we can
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets get rid of the $s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did so in all comments. In the meantime, replaced all the non-ascii characters, they broke at least my editor :).

@amosonn
Copy link
Contributor Author

amosonn commented Feb 2, 2020

I also split the commits a little finer, so it's easier to see the progression. I guess this will all be squashed later anyway?

} else {
// We iterate "up" using the following formula:
//
// $$ xy 1 (mod 2ⁿ) → xy (2 - xy) 1 (mod 2²ⁿ) $$
// ` xy = 1 (mod 2^n) -> xy (2 - xy) = 1 (mod 2^2n) `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ` xy = 1 (mod 2^n) -> xy (2 - xy) = 1 (mod 2^2n) `
// ` xy = 1 (mod 2^n) -> xy (2 - xy) = 1 (mod 2^(2n)) `

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@amosonn
Copy link
Contributor Author

amosonn commented Feb 2, 2020

Thanks for looking at all of this!

Regarding your benchmarking: I think it would be interesting to do some more comparisons, at least for a few cases such as v2 vs v4.

As previously noted, the difference between v0 and v1 of removing the intrinsics::rem is overwhelming, especially since the compiler might not optimize it away, and the question is what do the other optimizations bring (we also change the end condition comparison, saving an iteration on edge-cases). And from v1 to v2, we reduce the modulo, thereby reducing the amount of iterations required (for struct sizes which are divisible by a power of 2, which is the common case).

Going on, the story becomes different. From v2 to v3, we transform our loop to perform a shl operation instead of a mul for the "counter progression", which should have improved performance, at least for large enough alignments (such that a second iteration of the loop is required). This should have been true, regardless of other "cheaper" computations added as a result, such as (1 << mpow) - 1 to calculate the mask, or apow - gcdpow to calculate the end-condition of the loop. It seems, however, that in some cases it actually hurts performance - as in, v3 performs consistently worse than v2, across almost all benchmarks, and only the "correct" implementation v4 actually performs better. I chose this implementation (where the mask is passed from outside and not recomputed, and where it is applied in the small modulo case and not in the large modulo case) purely by benchmarking, but I cannot explain why this one should perform better and the other ones worse.

This seems surprising at best and brittle at worse, so I would at least like to get more convinced that other platforms behave similarly, which should mean that there is a "reason" why this is the best implementation, even if we don't understand it. If they do not, i.e. v4 performs worse than v2 on some platforms, this means that this part of the optimization is too "magical", and we should probably stick to v2 (currently the first three commits).

If you are willing to put more time into running those benchmarks, I can write a comprehensive comparison of the relevant parts. Otherwise, I would suggest merging the first 3 commits for now, and leaving the rest for a later PR.

Edit: BTW, FWIW, I think the case of the stride being constant will be even more common than that of the alignment being common, after all, that's the size of a (specific) struct. It's just that stride is an implicit argument to the original method, however I'd expect it to be derived as a constant during monomorphmization. Actually, this might mean that any benchmarks which don't consider stride to be constant are moot.

let a2minus1 = a2.wrapping_sub(1);
let s2 = smoda >> gcdpow;
let minusp2 = a2.wrapping_sub(pmoda >> gcdpow);
// mod_pow_2_inv returns a result which may be out of $a'$-s range, but it's fine to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick here and line below

Suggested change
// mod_pow_2_inv returns a result which may be out of $a'$-s range, but it's fine to
// mod_pow_2_inv returns a result which may be out of `a'`-s range, but it's fine to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

@nagisa
Copy link
Member

nagisa commented Feb 2, 2020

It's just that stride is an implicit argument to the original method, however I'd expect it to be derived as a constant during monomorphmization. Actually, this might mean that any benchmarks which don't consider stride to be constant are moot.

I did explicitly account for that in my adaptation of the benchmarks (where I made the stride a const generic).


…and we should probably stick to v2…

Yeah, lets land the simple and most impactful changes first (the first three commits) and consider the other ones in isolation.

I did some assembly spelunking while benchmarks were running (const stride here is presumed, I didn’t even look at the cases where stride is not constant, as they are not relevant):

  • When const align, at any version (v0, v2 or v4), the generated assembly is exactly the same. The assembly for this scenario was already optimal, and I doubt any changes will be able to improve it much.
    • Remember, const align is the common use-case of this function, which is why I didn’t consider rem to be much of a problem when initially writing this code.
  • When !const align, the story is different and the most costly operation in v0 is the rem surfaces into machine code. Since v2 removes rem, that’s by itself a huge improvement to the throughput of this function. v4 appears to slim down the assembly somewhat over v2, but the overall cycle cost difference between the two is miniscule.

This makes me wonder if the complexity in reasoning that v4 adds is worth it. I’d rather have code be readable than squeeze out a single-digit improvement in cycles.

EDIT: I can confirm your own results that there is a significant improvement in observed throughput between v2 and v4… I didn’t compile the results across all the machines yet, though.

- Stopping condition inside mod_inv can be >= instead of >
- Remove intrinsics::unchecked_rem, we are working modulu powers-of-2 so
we can simply mask
- When calculating the inverse, it's enough to work `mod a/g` instead
  of `mod a`.
- As explained in the comment inside mod_inv, it is valid to work mod
  `usize::max_value()` right until the end.
- Instead of squaring the modulu until it is larger than the required
  one, we double log2 of it. This means a shift instead of mul each
  iteration.
- Pass mask as arg to mod_pow_2_inv instead of calculating it again.
- Remove redundant masking from mod_pow_2_inv, caller site already takes
  care of this (after mul): according to benchmarking, the best
  performance is acheived when this masking is present in the fast
  branch (for small modulos), but absent from the slow branch.
@amosonn amosonn closed this Feb 3, 2020
@amosonn amosonn deleted the patch-1 branch February 3, 2020 00:04
@amosonn
Copy link
Contributor Author

amosonn commented Feb 3, 2020

Ok, in attempting to trim this down to the three first commits, I got into a fight with github and closed this :) Here's the new PR: #68787
But I suggest we keep the discussion here. Once that one's merged, I'll open a second one for the rest of the PR.

Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Feb 3, 2020
Optimize core::ptr::align_offset (part 1)

r? @nagisa
See rust-lang#68616 for main discussion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants