Skip to content

Chacha: performance improvements #1192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 23, 2021
Merged

Conversation

kazcw
Copy link
Contributor

@kazcw kazcw commented Oct 14, 2021

Improve AVX2 vectorizability of copying results to buffer

Also use a faster method of incrementing the pos counter on LE.
Total performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).

CPU: E5-2620 v3 (avx2)

BEFORE

test gen_bytes_chacha8       ... bench:     501,639 ns/iter (+/- 411) = 2041 MB/s
test gen_bytes_chacha12      ... bench:     637,970 ns/iter (+/- 485) = 1605 MB/s
test gen_bytes_chacha20      ... bench:     927,819 ns/iter (+/- 6,716) = 1103 MB/s

AFTER

test gen_bytes_chacha8       ... bench:     367,249 ns/iter (+/- 205) = 2788 MB/s (+37%)
test gen_bytes_chacha12      ... bench:     512,407 ns/iter (+/- 1,121) = 1998 MB/s (+24%)
test gen_bytes_chacha20      ... bench:     803,428 ns/iter (+/- 1,592) = 1274 MB/s (+16%)

CPU: X5640L (no avx2)

BEFORE

test gen_bytes_chacha8       ... bench:     873,669 ns/iter (+/- 682) = 1172 MB/s
test gen_bytes_chacha12      ... bench:   1,179,485 ns/iter (+/- 2,889) = 868 MB/s
test gen_bytes_chacha20      ... bench:   1,768,024 ns/iter (+/- 3,030) = 579 MB/s

AFTER

test gen_bytes_chacha8       ... bench:     826,541 ns/iter (+/- 1,599) = 1238 MB/s (+6%)
test gen_bytes_chacha12      ... bench:   1,146,041 ns/iter (+/- 3,173) = 893 MB/s (+3%)
test gen_bytes_chacha20      ... bench:   1,756,848 ns/iter (+/- 4,348) = 582 MB/s (+1%)

@kazcw kazcw marked this pull request as draft October 14, 2021 05:45
@kazcw
Copy link
Contributor Author

kazcw commented Oct 14, 2021

I noticed travis tests failing to build the ppv-lite update, so I yanked that version. The problem relates to non-x86_64 platforms. I'll take this out of draft when I have that working.

@kazcw kazcw marked this pull request as ready for review October 14, 2021 06:05
@kazcw kazcw marked this pull request as draft October 14, 2021 06:09
kazcw added 2 commits October 13, 2021 23:20
On little-endian platforms we can use native vector operations to
increment the pos counter, because it is packed little-endian into the
state vector.
Improve AVX2 vectorizability of copying results to buffer. Performance
gain measured at 15% (ChaCha20) to 37% (ChaCha8).
@kazcw kazcw marked this pull request as ready for review October 14, 2021 06:26
@dhardy
Copy link
Member

dhardy commented Oct 14, 2021

Also for comparison on Zen3 5800X:

# BEFORE
test gen_bytes_chacha8       ... bench:     228,002 ns/iter (+/- 2,042) = 4491 MB/s
test gen_bytes_chacha12      ... bench:     297,280 ns/iter (+/- 2,171) = 3444 MB/s
test gen_bytes_chacha20      ... bench:     435,253 ns/iter (+/- 3,133) = 2352 MB/s
# AFTER
test gen_bytes_chacha8       ... bench:     153,607 ns/iter (+/- 1,184) = 6666 MB/s (+48%)
test gen_bytes_chacha12      ... bench:     223,379 ns/iter (+/- 8,808) = 4584 MB/s (+33%)
test gen_bytes_chacha20      ... bench:     361,063 ns/iter (+/- 3,144) = 2836 MB/s (+21%)

@vks
Copy link
Collaborator

vks commented Oct 14, 2021

Those are some very impressive improvements! Could you please also update the rand_chacha changelog?

@dhardy
Copy link
Member

dhardy commented Oct 14, 2021

I'm not sure if calling it a 48% improvement in the changelog is honest. These are my numbers for the 0.8.0 release:

test gen_bytes_chacha8       ... bench:     164,178 ns/iter (+/- 5,078) = 6237 MB/s
test gen_bytes_chacha12      ... bench:     233,660 ns/iter (+/- 5,813) = 4382 MB/s
test gen_bytes_chacha20      ... bench:     374,233 ns/iter (+/- 6,502) = 2736 MB/s

Compared to that, this PR's performance is +7%, +5%, +4% (almost margin of error).

#1181 is what decreased performance; this is just recovering it.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, thanks for the fix @kazcw! Other than the misleading changelog I approve the PR.

@kazcw
Copy link
Contributor Author

kazcw commented Oct 14, 2021

Oh, that's disappointing. I didn't realize that PR was never reverted.

@dhardy
Copy link
Member

dhardy commented Oct 15, 2021

That PR removes an unsafe and ideally wants more looking into. Sorry, I assumed this was an attempt to optimise the result.

I can't cleanly revert #1181 on top of this PR (for testing purposes).

@kazcw
Copy link
Contributor Author

kazcw commented Oct 15, 2021

My thinking was that no incremental improvement to #1181 was possible--the compiler couldn't see through that choice of abstraction for AVX2, so that was it. The goal was still valid, but that attempt had failed. I expected it would be reverted and didn't realize that it hadn't yet. (I should have communicated more there and made sure we were all on the same page.)

I started this as a new attempt at improving outputting, since after #1181 I realized that the optimal approach for AVX2 would need an explicit 4x4(128b)-transpose. Because I made a wrong assumption about which outputting code I was replacing, I was mistaken about what the baseline was. Sorry for the confusion, it made this small win a lot less exciting.

But anyway, progress is progress.

@dhardy
Copy link
Member

dhardy commented Oct 18, 2021

Okay — so to go from here, shall I merge this, then you can optionally create a new PR to revert/adjust #1181?

Also, it would be useful to have a changelog entry of some kind.

@kazcw
Copy link
Contributor Author

kazcw commented Oct 22, 2021

Updated changelog.

Nothing more need be done about #1181.

@dhardy
Copy link
Member

dhardy commented Oct 23, 2021

Thanks @kazcw.

@dhardy dhardy merged commit 0f4fc6b into rust-random:master Oct 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants