Chacha: performance improvements #1192

kazcw · 2021-10-14T05:40:19Z

Improve AVX2 vectorizability of copying results to buffer

Also use a faster method of incrementing the pos counter on LE.
Total performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).

CPU: E5-2620 v3 (avx2)

BEFORE

test gen_bytes_chacha8       ... bench:     501,639 ns/iter (+/- 411) = 2041 MB/s
test gen_bytes_chacha12      ... bench:     637,970 ns/iter (+/- 485) = 1605 MB/s
test gen_bytes_chacha20      ... bench:     927,819 ns/iter (+/- 6,716) = 1103 MB/s

AFTER

test gen_bytes_chacha8       ... bench:     367,249 ns/iter (+/- 205) = 2788 MB/s (+37%)
test gen_bytes_chacha12      ... bench:     512,407 ns/iter (+/- 1,121) = 1998 MB/s (+24%)
test gen_bytes_chacha20      ... bench:     803,428 ns/iter (+/- 1,592) = 1274 MB/s (+16%)

CPU: X5640L (no avx2)

BEFORE

test gen_bytes_chacha8       ... bench:     873,669 ns/iter (+/- 682) = 1172 MB/s
test gen_bytes_chacha12      ... bench:   1,179,485 ns/iter (+/- 2,889) = 868 MB/s
test gen_bytes_chacha20      ... bench:   1,768,024 ns/iter (+/- 3,030) = 579 MB/s

AFTER

test gen_bytes_chacha8       ... bench:     826,541 ns/iter (+/- 1,599) = 1238 MB/s (+6%)
test gen_bytes_chacha12      ... bench:   1,146,041 ns/iter (+/- 3,173) = 893 MB/s (+3%)
test gen_bytes_chacha20      ... bench:   1,756,848 ns/iter (+/- 4,348) = 582 MB/s (+1%)

kazcw · 2021-10-14T05:47:16Z

I noticed travis tests failing to build the ppv-lite update, so I yanked that version. The problem relates to non-x86_64 platforms. I'll take this out of draft when I have that working.

On little-endian platforms we can use native vector operations to increment the pos counter, because it is packed little-endian into the state vector.

Improve AVX2 vectorizability of copying results to buffer. Performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).

dhardy · 2021-10-14T10:00:32Z

Also for comparison on Zen3 5800X:

# BEFORE
test gen_bytes_chacha8       ... bench:     228,002 ns/iter (+/- 2,042) = 4491 MB/s
test gen_bytes_chacha12      ... bench:     297,280 ns/iter (+/- 2,171) = 3444 MB/s
test gen_bytes_chacha20      ... bench:     435,253 ns/iter (+/- 3,133) = 2352 MB/s
# AFTER
test gen_bytes_chacha8       ... bench:     153,607 ns/iter (+/- 1,184) = 6666 MB/s (+48%)
test gen_bytes_chacha12      ... bench:     223,379 ns/iter (+/- 8,808) = 4584 MB/s (+33%)
test gen_bytes_chacha20      ... bench:     361,063 ns/iter (+/- 3,144) = 2836 MB/s (+21%)

vks · 2021-10-14T10:46:59Z

Those are some very impressive improvements! Could you please also update the rand_chacha changelog?

dhardy · 2021-10-14T16:48:44Z

I'm not sure if calling it a 48% improvement in the changelog is honest. These are my numbers for the 0.8.0 release:

test gen_bytes_chacha8       ... bench:     164,178 ns/iter (+/- 5,078) = 6237 MB/s
test gen_bytes_chacha12      ... bench:     233,660 ns/iter (+/- 5,813) = 4382 MB/s
test gen_bytes_chacha20      ... bench:     374,233 ns/iter (+/- 6,502) = 2736 MB/s

Compared to that, this PR's performance is +7%, +5%, +4% (almost margin of error).

#1181 is what decreased performance; this is just recovering it.

dhardy

Anyway, thanks for the fix @kazcw! Other than the misleading changelog I approve the PR.

kazcw · 2021-10-14T16:59:33Z

Oh, that's disappointing. I didn't realize that PR was never reverted.

dhardy · 2021-10-15T10:03:20Z

That PR removes an unsafe and ideally wants more looking into. Sorry, I assumed this was an attempt to optimise the result.

I can't cleanly revert #1181 on top of this PR (for testing purposes).

kazcw · 2021-10-15T15:42:25Z

My thinking was that no incremental improvement to #1181 was possible--the compiler couldn't see through that choice of abstraction for AVX2, so that was it. The goal was still valid, but that attempt had failed. I expected it would be reverted and didn't realize that it hadn't yet. (I should have communicated more there and made sure we were all on the same page.)

I started this as a new attempt at improving outputting, since after #1181 I realized that the optimal approach for AVX2 would need an explicit 4x4(128b)-transpose. Because I made a wrong assumption about which outputting code I was replacing, I was mistaken about what the baseline was. Sorry for the confusion, it made this small win a lot less exciting.

But anyway, progress is progress.

dhardy · 2021-10-18T09:28:50Z

Okay — so to go from here, shall I merge this, then you can optionally create a new PR to revert/adjust #1181?

Also, it would be useful to have a changelog entry of some kind.

kazcw · 2021-10-22T16:21:46Z

Updated changelog.

Nothing more need be done about #1181.

dhardy · 2021-10-23T10:40:36Z

Thanks @kazcw.

kazcw marked this pull request as draft October 14, 2021 05:45

kazcw marked this pull request as ready for review October 14, 2021 06:05

kazcw force-pushed the chacha-faster branch from f92197f to ad45758 Compare October 14, 2021 06:06

kazcw marked this pull request as draft October 14, 2021 06:09

kazcw added 2 commits October 13, 2021 23:20

rand_chacha: improve little-endian performance incrementing pos counter

6e6cc35

On little-endian platforms we can use native vector operations to increment the pos counter, because it is packed little-endian into the state vector.

rand_chacha: optimize outputting

6455dc4

Improve AVX2 vectorizability of copying results to buffer. Performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).

kazcw force-pushed the chacha-faster branch from ad45758 to 6455dc4 Compare October 14, 2021 06:20

kazcw marked this pull request as ready for review October 14, 2021 06:26

dhardy approved these changes Oct 14, 2021

View reviewed changes

kazcw force-pushed the chacha-faster branch from f76c1f0 to 6455dc4 Compare October 14, 2021 16:59

rand_chacha: update changelog

fb7af73

dhardy merged commit 0f4fc6b into rust-random:master Oct 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chacha: performance improvements #1192

Chacha: performance improvements #1192

kazcw commented Oct 14, 2021

kazcw commented Oct 14, 2021

dhardy commented Oct 14, 2021 •

edited

Loading

vks commented Oct 14, 2021

dhardy commented Oct 14, 2021 •

edited

Loading

dhardy left a comment

kazcw commented Oct 14, 2021

dhardy commented Oct 15, 2021 •

edited

Loading

kazcw commented Oct 15, 2021

dhardy commented Oct 18, 2021

kazcw commented Oct 22, 2021

dhardy commented Oct 23, 2021

Chacha: performance improvements #1192

Chacha: performance improvements #1192

Conversation

kazcw commented Oct 14, 2021

Improve AVX2 vectorizability of copying results to buffer

CPU: E5-2620 v3 (avx2)

CPU: X5640L (no avx2)

kazcw commented Oct 14, 2021

dhardy commented Oct 14, 2021 • edited Loading

vks commented Oct 14, 2021

dhardy commented Oct 14, 2021 • edited Loading

dhardy left a comment

Choose a reason for hiding this comment

kazcw commented Oct 14, 2021

dhardy commented Oct 15, 2021 • edited Loading

kazcw commented Oct 15, 2021

dhardy commented Oct 18, 2021

kazcw commented Oct 22, 2021

dhardy commented Oct 23, 2021

dhardy commented Oct 14, 2021 •

edited

Loading

dhardy commented Oct 14, 2021 •

edited

Loading

dhardy commented Oct 15, 2021 •

edited

Loading