-
Notifications
You must be signed in to change notification settings - Fork 523
Much faster Seek perf + non vector path #524
Conversation
private static readonly Vector<byte> _vectorColons = new Vector<byte>((byte)':'); | ||
private static readonly Vector<byte> _vectorSpaces = new Vector<byte>((byte)' '); | ||
private static readonly Vector<byte> _vectorQuestionMarks = new Vector<byte>((byte)'?'); | ||
private static readonly Vector<byte> _vectorPercentages = new Vector<byte>((byte)'%'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is because readonly can't be passed by ref :(
There is a further combining change that will remove all the fixed blocks (as buffers are already fixed) - but this stands alone as is. |
3a501ec
to
0578baf
Compare
0578baf
to
429c04f
Compare
I want to see the individual perf impact of the new |
Ok with using what's here for that? |
{ | ||
private readonly static int _vectorSpan = Vector<byte>.Count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just a style change, or is there a perf impact to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not trusting it to be an intrinsic const when its not hardware accelerated as it is going through a static property to a non-readonly static. Whereas know that the readonly static int
will be a jitted const.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
429c04f
to
8171887
Compare
|
||
if (byte0Equals.Equals(Vector<byte>.Zero)) | ||
#endif | ||
if (following >= _vectorSpan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In coreclr x64 release version this code is generated (changed line 213 to compare the generated code).
213: if (following >= _vectorSpan && following >= Vector<byte>.Count)
000007FE16EDBE44 mov rcx,7FE16F59858h
000007FE16EDBE4E mov edx,21h
000007FE16EDBE53 call 000007FE76BFBC50
000007FE16EDBE58 cmp r14d,dword ptr [7FE16F5992Ch]
000007FE16EDBE5F jl 000007FE16EDBF10
000007FE16EDBE65 cmp r14d,10h
000007FE16EDBE69 jl 000007FE16EDBF10
the first 3 lines is the lazy initialize of _vectorSpan. Because this is a struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm (line number down one as commented out the Vector.IsHardwareAccelerated
check)
x86
215: if (following >= _vectorSpan)
0569545A cmp dword ptr [ebp-10h],10h
0569545E jl 0569554D
216: {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But as you say x64 coreclr
213: if (following >= _vectorSpan)
00007FFEA7EFD24F mov rcx,7FFEA7DF0B28h
00007FFEA7EFD259 mov edx,21h
00007FFEA7EFD25E call 00007FFF079EBC50
00007FFEA7EFD263 mov rdx,7FFEA7DF0BFCh
00007FFEA7EFD26D cmp r14d,dword ptr [rdx]
00007FFEA7EFD270 jl 00007FFEA7EFD316
00007FFEA7EFD276 mov edx,dword ptr [r13+8]
00007FFEA7EFD27A cmp ebp,edx
00007FFEA7EFD27C jae 00007FFEA7EFD388
00007FFEA7EFD282 lea ecx,[rbp+0Fh]
00007FFEA7EFD285 cmp ecx,edx
00007FFEA7EFD287 jae 00007FFEA7EFD388
00007FFEA7EFD28D movups xmm0,xmmword ptr [r13+rbp+10h]
214: {
215: var byte0Equals = Vector.Equals(new Vector<byte>(array, index), byte0Vector);
00007FFEA7EFD293 movups xmm1,xmmword ptr [rsi]
00007FFEA7EFD296 pcmpeqb xmm0,xmm1
00007FFEA7EFD29A movaps xmmword ptr [rsp+30h],xmm0
216:
217: if (byte0Equals.Equals(Vector<byte>.Zero))
00007FFEA7EFD29F movaps xmm0,xmmword ptr [rsp+30h]
00007FFEA7EFD2A4 pxor xmm1,xmm1
00007FFEA7EFD2A8 movaps xmm2,xmm0
00007FFEA7EFD2AB pcmpeqd xmm2,xmm1
00007FFEA7EFD2AF pshufd xmm3,xmm2,4Eh
00007FFEA7EFD2B4 andps xmm2,xmm3
00007FFEA7EFD2B7 pshufd xmm3,xmm2,1
00007FFEA7EFD2BC pand xmm2,xmm3
00007FFEA7EFD2C0 movd edx,xmm2
00007FFEA7EFD2C4 cmp edx,0FFFFFFFFh
00007FFEA7EFD2CA sete dl
00007FFEA7EFD2CD movzx edx,dl
00007FFEA7EFD2D0 test edx,edx
00007FFEA7EFD2D2 je 00007FFEA7EFD2E8
218: {
219: following -= _vectorSpan;
00007FFEA7EFD2D4 mov rdx,7FFEA7DF0BFCh
00007FFEA7EFD2DE sub r14d,dword ptr [rdx]
220: index += _vectorSpan;
00007FFEA7EFD2E1 add ebp,dword ptr [rdx]
00007FFEA7EFD2E3 jmp 00007FFEA7EFD37A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benaadams After initialize is done, later compiled functions will have hardcoded 10h.
Because you already check on IsHardwareAccelerated you can safely remove _vectorSpan and replace it with Vector.Count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the section "Statics" in http://joeduffyblog.com/2015/12/19/safe-native-code/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/dotnet/coreclr/issues/1193#issuecomment-118573955
They should be still treated as JIT time constants in methods that are JITed after the static constructor has run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, not sure I understand what the optimal approach is dotnet/roslyn#4448
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemoryPoolIterator2Extensions.GetKnownString
also has this problem with BitConverter.IsLittleEndian
285: // This optimization only works on little endian environments (for now).
286: if (!BitConverter.IsLittleEndian)
00007FFEA7EFEBB9 mov rcx,7FFEA7B85380h
00007FFEA7EFEBC3 mov edx,3
00007FFEA7EFEBC8 call 00007FFF079EBC50
00007FFEA7EFEBCD mov rax,7FFEA7B853D2h
00007FFEA7EFEBD7 movzx eax,byte ptr [rax]
00007FFEA7EFEBDA test eax,eax
00007FFEA7EFEBDC jne 00007FFEA7EFEBE8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K, new change seems to resolve it VectorSpan
is a const 10h
211: if (following >= Constants.VectorSpan)
00007FFE8A62C137 cmp ebx,10h
00007FFE8A62C13A jl 00007FFE8A62C1D4
00007FFE8A62C140 mov ecx,dword ptr [r15+8]
00007FFE8A62C144 cmp edi,ecx
00007FFE8A62C146 jae 00007FFE8A62C23D
00007FFE8A62C14C lea r8d,[rdi+0Fh]
00007FFE8A62C150 cmp r8d,ecx
00007FFE8A62C153 jae 00007FFE8A62C23D
00007FFE8A62C159 movups xmm0,xmmword ptr [r15+rdi+10h]
Also the jit now removes the BitConverter.IsLittleEndian
test and static initialization gone with it.
283: knownString = null;
00007FFE8A62D6E2 push rbx
00007FFE8A62D6E3 sub rsp,20h
00007FFE8A62D6E7 mov rsi,rcx
00007FFE8A62D6EA mov rdi,rdx
00007FFE8A62D6ED mov rbx,r8
00007FFE8A62D6F0 mov rcx,7FFE8A720B28h
00007FFE8A62D6FA mov edx,22h
00007FFE8A62D6FF call 00007FFEEA30BC50
00007FFE8A62D704 xor ecx,ecx
00007FFE8A62D706 mov qword ptr [rbx],rcx
289: }
290:
291: var inputLength = begin.GetLength(end);
00007FFE8A62D709 mov rcx,rsi
00007FFE8A62D70C mov rdx,rdi
00007FFE8A62D70F call 00007FFE8A5F95D0
00007FFE8A62D714 mov edi,eax
Same issue with Any So I've moved the bytes to Constants also and added a pre-access init in Constants.
|
@@ -20,5 +23,18 @@ internal class Constants | |||
/// for info on the format. | |||
/// </summary> | |||
public const string RFC1123DateFormat = "r"; | |||
|
|||
public readonly static int VectorSpan = Vector<byte>.Count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector<byte>.Count
should Jit to const when intrinsic; not sure when Vector.IsHardwareAccelerated == false
; as the backing static currently isn't readonly
(https://github.com/dotnet/corefx/issues/5152) however we know readonly static int
does Jit to const; otherwise so intermediate readonly static
for that.
Numbers Dev:
benaadams/memorypool-seek-merged:
benaadams/memorypool-seek-partial-merged (first 2 commits):
|
So +4% and +5% The last commit is working round a bug in the jit; added coreclr issue https://github.com/dotnet/coreclr/issues/2526 and comment on the |
cfa19d5
to
8e3bf56
Compare
Again all of the the |
8e3bf56
to
9a3af36
Compare
9a3af36
to
1664ea3
Compare
Removed the other ones from this |
From #519
Updated #514
Static vectors by ref rather than by copy
Faster path when length <
Vector<byte>.Count
Skip Vector path when
Vector.IsHardwareAccelerated != true
unless in debug/test (e.g. linux and x86)Linux hopefully resolved in RC2? https://github.com/dotnet/coreclr/issues/983
Common headers have common offsets
Also has Faster output header handling #528 in it :-/ due to the generated code file.
Resolves #512
Resolves #515