-
Notifications
You must be signed in to change notification settings - Fork 523
Conversation
{ | ||
return GetAsciiStringStack(start.Block.Array, start.Index, length); | ||
return default(string); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null
f993941
to
3643f03
Compare
{ | ||
output[i + outputOffset] = (char)input[i + inputOffset]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this PR removes more code than it adds, but this part now seems slightly more complicated. Are we sure that manual unrolling is faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep: #399 (comment)
Non-unrolled (SequentialPointer)
Incrementing value on each unroll step (UnrolledPointer)
Adding to value on each step; and incrementing it at end (UnrolledParallelPointer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that now 😄 It's interesting how UnrolledParallelPointer is so consistently better from 1 character all the way to 16384 character strings.
It might be interesting to see how something unaligned like a 33 character string performs, since I don't think any of the test cases hit both loops in GetAsciiStringImplementationB
(UnrolledParallelPointer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point, will get that in new tests when doing looking for new numbers.
The ParallelPointer works because the cpu can pipeline the instructions as they are independent of each other (i.e. the effects of one line don't effect the results of the next); incrementing the pointer at each step means the next step can't run until that pointer addition is completed as its dependent on the result. Is also one of the problems with i++
in a regular loop; the next i++
can't be run until that has completed as it is dependent of its result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also If I didn't unroll I think this would be a perf regression as the string is initialized; rather than being operations on initialized stack memory; then a memcopy to an uninitialized string; however I think the code being less weird makes up for that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely like the code being less weird. Interesting point about the instruction pipelining. That explains the somewhat unintuitive results. It's amazing the stuff you have to consider at this level of optimization. Someday maybe the compiler will be smart enough to do this kind of thing for us.
I'm going to keep this PR open pending the new numbers.
3643f03
to
ea823bf
Compare
Updated with unrolling and improved instruction ordering from #519 |
{ | ||
output[i + outputOffset] = (char)input[i + inputOffset]; | ||
fixed (byte* blockStart = block.Array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a further combining change that will this fixed block (as buffers are already fixed) - but this stands alone as is.
1d26e56
to
94fb53e
Compare
Thanks for all research/testing you put into this. LGTM. |
94fb53e
to
cd502ac
Compare
cd502ac
to
ea3e64a
Compare
Rebased |
Actually, before I merge this, do you have the test script and results that show this is the fasted way to create a string from ASCII bytes? #399 (comment) doesn't cover the current implementation. |
Sure :D |
Still need to do the benchmarking with various sizes; though taking a step as the bracketed chunk where 119 being 12 steps (120 dropping back to 10 steps as is multiple of 12)
Working for a 14 stage pipelined cpu so Is different for the non-byte conversion copy as int and longs (and vectors) can be used. |
Running benchmarks now |
Benchmark using Actually creating the strings to test properly. BenchmarkDotNet-Dev=v0.8.0.0+
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz, ProcessorCount=8
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Type=AsciiCreate Mode=Throughput Platform=X64 Jit=RyuJit .NET=HostFramework toolchain=Classic Runtime=Clr Warmup=5 Target=10
Though |
Thanks for redoing the tests. |
+1% RPS - also less complicated; with a single code path rather than three,
Before
After