Speedup int log10 branchless #88788

falk-hueffner · 2021-09-09T17:48:23Z

This is achieved with a branchless bit-twiddling implementation of the case x < 100_000, and using this as building block.

Benchmark on an Intel i7-8700K (Coffee Lake):

name                                   old ns/iter  new ns/iter  diff ns/iter   diff %  speedup
num::int_log::u8_log10_predictable     165          169                     4    2.42%   x 0.98
num::int_log::u8_log10_random          438          423                   -15   -3.42%   x 1.04
num::int_log::u8_log10_random_small    438          423                   -15   -3.42%   x 1.04
num::int_log::u16_log10_predictable    633          417                  -216  -34.12%   x 1.52
num::int_log::u16_log10_random         908          471                  -437  -48.13%   x 1.93
num::int_log::u16_log10_random_small   945          471                  -474  -50.16%   x 2.01
num::int_log::u32_log10_predictable    1,496        1,340                -156  -10.43%   x 1.12
num::int_log::u32_log10_random         1,076        873                  -203  -18.87%   x 1.23
num::int_log::u32_log10_random_small   1,145        874                  -271  -23.67%   x 1.31
num::int_log::u64_log10_predictable    4,005        3,171                -834  -20.82%   x 1.26
num::int_log::u64_log10_random         1,247        1,021                -226  -18.12%   x 1.22
num::int_log::u64_log10_random_small   1,265        921                  -344  -27.19%   x 1.37
num::int_log::u128_log10_predictable   39,667       39,579                -88   -0.22%   x 1.00
num::int_log::u128_log10_random        6,456        6,696                 240    3.72%   x 0.96
num::int_log::u128_log10_random_small  4,108        3,903                -205   -4.99%   x 1.05

Benchmark on an M1 Mac Mini:

name                                   old ns/iter  new ns/iter  diff ns/iter   diff %  speedup
num::int_log::u8_log10_predictable     143          130                   -13   -9.09%   x 1.10
num::int_log::u8_log10_random          375          325                   -50  -13.33%   x 1.15
num::int_log::u8_log10_random_small    376          325                   -51  -13.56%   x 1.16
num::int_log::u16_log10_predictable    500          322                  -178  -35.60%   x 1.55
num::int_log::u16_log10_random         794          405                  -389  -48.99%   x 1.96
num::int_log::u16_log10_random_small   1,035        405                  -630  -60.87%   x 2.56
num::int_log::u32_log10_predictable    1,144        894                  -250  -21.85%   x 1.28
num::int_log::u32_log10_random         832          786                   -46   -5.53%   x 1.06
num::int_log::u32_log10_random_small   832          787                   -45   -5.41%   x 1.06
num::int_log::u64_log10_predictable    2,681        2,057                -624  -23.27%   x 1.30
num::int_log::u64_log10_random         1,015        806                  -209  -20.59%   x 1.26
num::int_log::u64_log10_random_small   1,004        795                  -209  -20.82%   x 1.26
num::int_log::u128_log10_predictable   56,825       56,526               -299   -0.53%   x 1.01
num::int_log::u128_log10_random        9,056        8,861                -195   -2.15%   x 1.02
num::int_log::u128_log10_random_small  1,528        1,527                  -1   -0.07%   x 1.00

The 128 bit case remains ridiculously slow because llvm fails to optimize division by a constant 128-bit value to multiplications. This could be worked around but it seems preferable to fix this in llvm.

From u32 up, table lookup (like suggested here) is still faster, but requires a hardware leading_zeros to be viable, and might clog up the cache.

This is achieved with a branchless bit-twiddling implementation of the case x < 100_000, and using this as building block. Benchmark on an Intel i7-8700K (Coffee Lake): name old ns/iter new ns/iter diff ns/iter diff % speedup num::int_log::u8_log10_predictable 165 169 4 2.42% x 0.98 num::int_log::u8_log10_random 438 423 -15 -3.42% x 1.04 num::int_log::u8_log10_random_small 438 423 -15 -3.42% x 1.04 num::int_log::u16_log10_predictable 633 417 -216 -34.12% x 1.52 num::int_log::u16_log10_random 908 471 -437 -48.13% x 1.93 num::int_log::u16_log10_random_small 945 471 -474 -50.16% x 2.01 num::int_log::u32_log10_predictable 1,496 1,340 -156 -10.43% x 1.12 num::int_log::u32_log10_random 1,076 873 -203 -18.87% x 1.23 num::int_log::u32_log10_random_small 1,145 874 -271 -23.67% x 1.31 num::int_log::u64_log10_predictable 4,005 3,171 -834 -20.82% x 1.26 num::int_log::u64_log10_random 1,247 1,021 -226 -18.12% x 1.22 num::int_log::u64_log10_random_small 1,265 921 -344 -27.19% x 1.37 num::int_log::u128_log10_predictable 39,667 39,579 -88 -0.22% x 1.00 num::int_log::u128_log10_random 6,456 6,696 240 3.72% x 0.96 num::int_log::u128_log10_random_small 4,108 3,903 -205 -4.99% x 1.05 Benchmark on an M1 Mac Mini: name old ns/iter new ns/iter diff ns/iter diff % speedup num::int_log::u8_log10_predictable 143 130 -13 -9.09% x 1.10 num::int_log::u8_log10_random 375 325 -50 -13.33% x 1.15 num::int_log::u8_log10_random_small 376 325 -51 -13.56% x 1.16 num::int_log::u16_log10_predictable 500 322 -178 -35.60% x 1.55 num::int_log::u16_log10_random 794 405 -389 -48.99% x 1.96 num::int_log::u16_log10_random_small 1,035 405 -630 -60.87% x 2.56 num::int_log::u32_log10_predictable 1,144 894 -250 -21.85% x 1.28 num::int_log::u32_log10_random 832 786 -46 -5.53% x 1.06 num::int_log::u32_log10_random_small 832 787 -45 -5.41% x 1.06 num::int_log::u64_log10_predictable 2,681 2,057 -624 -23.27% x 1.30 num::int_log::u64_log10_random 1,015 806 -209 -20.59% x 1.26 num::int_log::u64_log10_random_small 1,004 795 -209 -20.82% x 1.26 num::int_log::u128_log10_predictable 56,825 56,526 -299 -0.53% x 1.01 num::int_log::u128_log10_random 9,056 8,861 -195 -2.15% x 1.02 num::int_log::u128_log10_random_small 1,528 1,527 -1 -0.07% x 1.00 The 128 bit case remains ridiculously slow because llvm fails to optimize division by a constant 128-bit value to multiplications. This could be worked around but it seems preferable to fix this in llvm. From u32 up, table lookup (like suggested here rust-lang#70887 (comment)) is still faster, but requires a hardware leading_zero to be viable, and might clog up the cache.

rust-highfive · 2021-09-09T17:48:26Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @yaahc (or someone else) soon.

Please see the contribution instructions for more information.

library/core/src/num/int_log10.rs

nagisa · 2021-09-09T19:27:34Z

The 128 bit case remains ridiculously slow because llvm fails to optimize division by a constant 128-bit value to multiplications. This could be worked around but it seems preferable to fix this in llvm.

Yes, please avoid adding code to the standard library to work around poor i128 division performance. Codegen is indeed the right place to fix that and there have been some repeated attempts at this in the past.

joshtriplett · 2021-10-11T08:56:06Z

@bors r+

bors · 2021-10-11T08:56:08Z

📌 Commit 57c6235 has been approved by joshtriplett

bors · 2021-10-12T03:19:02Z

⌛ Testing commit 57c6235 with merge ffdf18d...

bors · 2021-10-12T06:12:32Z

☀️ Test successful - checks-actions
Approved by: joshtriplett
Pushing ffdf18d to master...

rust-timer · 2021-10-12T07:44:47Z

Finished benchmarking commit (ffdf18d): comparison url.

Summary: This benchmark run did not return any relevant changes.

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

Falk Hüffner added 2 commits September 6, 2021 12:19

Add benchmark for integer log10.

0c26a3b

rust-highfive assigned yaahc Sep 9, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 9, 2021

This comment has been minimized.

Sign in to view

the8472 reviewed Sep 9, 2021

View reviewed changes

library/core/src/num/int_log10.rs Outdated Show resolved Hide resolved

Cosmetic fixes.

57c6235

the8472 added the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Sep 9, 2021

JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 28, 2021

ghost mentioned this pull request Oct 10, 2021

Use log10 for optimizations #89737

Closed

7 tasks

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 11, 2021

bors added the merged-by-bors This PR was explicitly merged by bors. label Oct 12, 2021

bors merged commit ffdf18d into rust-lang:master Oct 12, 2021

rustbot added this to the 1.57.0 milestone Oct 12, 2021

yoshuawuyts mentioned this pull request Aug 10, 2022

Tracking Issue for Integer::{ilog,ilog2,ilog10} #70887

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speedup int log10 branchless #88788

Speedup int log10 branchless #88788

Uh oh!

falk-hueffner commented Sep 9, 2021

Uh oh!

rust-highfive commented Sep 9, 2021

Uh oh!

This comment has been minimized.

Uh oh!

nagisa commented Sep 9, 2021

Uh oh!

joshtriplett commented Oct 11, 2021

Uh oh!

bors commented Oct 11, 2021

Uh oh!

bors commented Oct 12, 2021

Uh oh!

bors commented Oct 12, 2021

Uh oh!

rust-timer commented Oct 12, 2021

Uh oh!

Uh oh!

Speedup int log10 branchless #88788

Speedup int log10 branchless #88788

Uh oh!

Conversation

falk-hueffner commented Sep 9, 2021

Uh oh!

rust-highfive commented Sep 9, 2021

Uh oh!

This comment has been minimized.

Uh oh!

nagisa commented Sep 9, 2021

Uh oh!

joshtriplett commented Oct 11, 2021

Uh oh!

bors commented Oct 11, 2021

Uh oh!

bors commented Oct 12, 2021

Uh oh!

bors commented Oct 12, 2021

Uh oh!

rust-timer commented Oct 12, 2021

Uh oh!

Uh oh!