-
-
Notifications
You must be signed in to change notification settings - Fork 461
Chacha: performance improvements #1192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I noticed travis tests failing to build the ppv-lite update, so I yanked that version. The problem relates to non-x86_64 platforms. I'll take this out of draft when I have that working. |
On little-endian platforms we can use native vector operations to increment the pos counter, because it is packed little-endian into the state vector.
Improve AVX2 vectorizability of copying results to buffer. Performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).
Also for comparison on Zen3 5800X:
|
Those are some very impressive improvements! Could you please also update the |
I'm not sure if calling it a 48% improvement in the changelog is honest. These are my numbers for the 0.8.0 release:
Compared to that, this PR's performance is +7%, +5%, +4% (almost margin of error). #1181 is what decreased performance; this is just recovering it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, thanks for the fix @kazcw! Other than the misleading changelog I approve the PR.
Oh, that's disappointing. I didn't realize that PR was never reverted. |
That PR removes an I can't cleanly revert #1181 on top of this PR (for testing purposes). |
My thinking was that no incremental improvement to #1181 was possible--the compiler couldn't see through that choice of abstraction for AVX2, so that was it. The goal was still valid, but that attempt had failed. I expected it would be reverted and didn't realize that it hadn't yet. (I should have communicated more there and made sure we were all on the same page.) I started this as a new attempt at improving outputting, since after #1181 I realized that the optimal approach for AVX2 would need an explicit 4x4(128b)-transpose. Because I made a wrong assumption about which outputting code I was replacing, I was mistaken about what the baseline was. Sorry for the confusion, it made this small win a lot less exciting. But anyway, progress is progress. |
Okay — so to go from here, shall I merge this, then you can optionally create a new PR to revert/adjust #1181? Also, it would be useful to have a changelog entry of some kind. |
Updated changelog. Nothing more need be done about #1181. |
Thanks @kazcw. |
Improve AVX2 vectorizability of copying results to buffer
Also use a faster method of incrementing the pos counter on LE.
Total performance gain measured at 15% (ChaCha20) to 37% (ChaCha8).
CPU: E5-2620 v3 (avx2)
BEFORE
AFTER
CPU: X5640L (no avx2)
BEFORE
AFTER