Optimise .fill() throughput #43
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Looking at the fill method, I noticed there was some room for optimisation, and also removing unnecessary
unwrap()
from the hot loop. Also, the remainder portion was generating new entropy blocks for everyu8
portion, when any remaining slice length would always be less than 8 bytes, or oneu64
block (which is whatWyRand
generates per new state), Therefore, a maximum of sevenu8
calls can be reduced to just oneu64
block.Afterwards, I simplified the copying of entropy to make use of
copy_from_slice
, since we know that either the slices will match (and need notry_into
conversions), or that the length of the input will always be smaller than the target.I then benchmarked the resulting code, on my AMD Ryzen 5 Pro 2500U (4C/8T 2,0Ghz base, 3.6Ghz max) with 16GB RAM.
Before:
After:
I was getting 68-70ns before, and now 64-65ns afterwards. Now, I added the
#[inline]
annotation out of curiosity and tested it, and was getting even more perf as a result:After with
#[inline]
So it might be advantageous to include it. So end result with all changes here is we've gone from 68-70ns to 64-65ns to then 56ns.