Skip to content
This repository was archived by the owner on Dec 18, 2018. It is now read-only.

Much faster Seek perf + non vector path #524

Closed
wants to merge 3 commits into from

Conversation

benaadams
Copy link
Contributor

From #519
Updated #514

Static vectors by ref rather than by copy
Faster path when length < Vector<byte>.Count
Skip Vector path when Vector.IsHardwareAccelerated != true unless in debug/test (e.g. linux and x86)

Linux hopefully resolved in RC2? https://github.com/dotnet/coreclr/issues/983

Common headers have common offsets

Also has Faster output header handling #528 in it :-/ due to the generated code file.

Resolves #512
Resolves #515

private static readonly Vector<byte> _vectorColons = new Vector<byte>((byte)':');
private static readonly Vector<byte> _vectorSpaces = new Vector<byte>((byte)' ');
private static readonly Vector<byte> _vectorQuestionMarks = new Vector<byte>((byte)'?');
private static readonly Vector<byte> _vectorPercentages = new Vector<byte>((byte)'%');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is because readonly can't be passed by ref :(

@benaadams
Copy link
Contributor Author

There is a further combining change that will remove all the fixed blocks (as buffers are already fixed) - but this stands alone as is.

@benaadams benaadams force-pushed the memorypool.seek branch 2 times, most recently from 3a501ec to 0578baf Compare December 29, 2015 16:14
@benaadams benaadams changed the title Seek perf + faster non vector path Much faster Seek perf + non vector path Dec 29, 2015
@halter73
Copy link
Member

I want to see the individual perf impact of the new SeekCommonHeader code. Really cool, but a little complicated if there isn't a measurable difference in perf.

@benaadams
Copy link
Contributor Author

Ok with using what's here for that?
e.g. can git reset --hard HEAD~2 to reverse out the SeekCommonHeader for a baseline

{
private readonly static int _vectorSpan = Vector<byte>.Count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a style change, or is there a perf impact to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not trusting it to be an intrinsic const when its not hardware accelerated as it is going through a static property to a non-readonly static. Whereas know that the readonly static int will be a jitted const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


if (byte0Equals.Equals(Vector<byte>.Zero))
#endif
if (following >= _vectorSpan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In coreclr x64 release version this code is generated (changed line 213 to compare the generated code).

   213:                         if (following >= _vectorSpan && following >= Vector<byte>.Count)
000007FE16EDBE44  mov         rcx,7FE16F59858h  
000007FE16EDBE4E  mov         edx,21h  
000007FE16EDBE53  call        000007FE76BFBC50  
000007FE16EDBE58  cmp         r14d,dword ptr [7FE16F5992Ch]  
000007FE16EDBE5F  jl          000007FE16EDBF10  
000007FE16EDBE65  cmp         r14d,10h  
000007FE16EDBE69  jl          000007FE16EDBF10  

the first 3 lines is the lazy initialize of _vectorSpan. Because this is a struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm (line number down one as commented out the Vector.IsHardwareAccelerated check)

x86

215:                         if (following >= _vectorSpan)
0569545A  cmp         dword ptr [ebp-10h],10h 
0569545E  jl          0569554D 
   216:                         {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as you say x64 coreclr

213:                         if (following >= _vectorSpan)
00007FFEA7EFD24F  mov         rcx,7FFEA7DF0B28h 
00007FFEA7EFD259  mov         edx,21h 
00007FFEA7EFD25E  call        00007FFF079EBC50 
00007FFEA7EFD263  mov         rdx,7FFEA7DF0BFCh 
00007FFEA7EFD26D  cmp         r14d,dword ptr [rdx] 
00007FFEA7EFD270  jl          00007FFEA7EFD316 
00007FFEA7EFD276  mov         edx,dword ptr [r13+8] 
00007FFEA7EFD27A  cmp         ebp,edx 
00007FFEA7EFD27C  jae         00007FFEA7EFD388 
00007FFEA7EFD282  lea         ecx,[rbp+0Fh] 
00007FFEA7EFD285  cmp         ecx,edx 
00007FFEA7EFD287  jae         00007FFEA7EFD388 
00007FFEA7EFD28D  movups      xmm0,xmmword ptr [r13+rbp+10h] 
   214:                         {
   215:                             var byte0Equals = Vector.Equals(new Vector<byte>(array, index), byte0Vector);
00007FFEA7EFD293  movups      xmm1,xmmword ptr [rsi] 
00007FFEA7EFD296  pcmpeqb     xmm0,xmm1 
00007FFEA7EFD29A  movaps      xmmword ptr [rsp+30h],xmm0 
   216: 
   217:                             if (byte0Equals.Equals(Vector<byte>.Zero))
00007FFEA7EFD29F  movaps      xmm0,xmmword ptr [rsp+30h] 
00007FFEA7EFD2A4  pxor        xmm1,xmm1 
00007FFEA7EFD2A8  movaps      xmm2,xmm0 
00007FFEA7EFD2AB  pcmpeqd     xmm2,xmm1 
00007FFEA7EFD2AF  pshufd      xmm3,xmm2,4Eh 
00007FFEA7EFD2B4  andps       xmm2,xmm3 
00007FFEA7EFD2B7  pshufd      xmm3,xmm2,1 
00007FFEA7EFD2BC  pand        xmm2,xmm3 
00007FFEA7EFD2C0  movd        edx,xmm2 
00007FFEA7EFD2C4  cmp         edx,0FFFFFFFFh 
00007FFEA7EFD2CA  sete        dl 
00007FFEA7EFD2CD  movzx       edx,dl 
00007FFEA7EFD2D0  test        edx,edx 
00007FFEA7EFD2D2  je          00007FFEA7EFD2E8 
   218:                             {
   219:                                 following -= _vectorSpan;
00007FFEA7EFD2D4  mov         rdx,7FFEA7DF0BFCh 
00007FFEA7EFD2DE  sub         r14d,dword ptr [rdx] 
   220:                                 index += _vectorSpan;
00007FFEA7EFD2E1  add         ebp,dword ptr [rdx] 
00007FFEA7EFD2E3  jmp         00007FFEA7EFD37A 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benaadams After initialize is done, later compiled functions will have hardcoded 10h.
Because you already check on IsHardwareAccelerated you can safely remove _vectorSpan and replace it with Vector.Count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/dotnet/coreclr/issues/1193#issuecomment-118573955

They should be still treated as JIT time constants in methods that are JITed after the static constructor has run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, not sure I understand what the optimal approach is dotnet/roslyn#4448

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MemoryPoolIterator2Extensions.GetKnownString also has this problem with BitConverter.IsLittleEndian

   285:             // This optimization only works on little endian environments (for now).
   286:             if (!BitConverter.IsLittleEndian)
00007FFEA7EFEBB9  mov         rcx,7FFEA7B85380h  
00007FFEA7EFEBC3  mov         edx,3  
00007FFEA7EFEBC8  call        00007FFF079EBC50  
00007FFEA7EFEBCD  mov         rax,7FFEA7B853D2h  
00007FFEA7EFEBD7  movzx       eax,byte ptr [rax]  
00007FFEA7EFEBDA  test        eax,eax  
00007FFEA7EFEBDC  jne         00007FFEA7EFEBE8  

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K, new change seems to resolve it VectorSpan is a const 10h

   211:                         if (following >= Constants.VectorSpan)
00007FFE8A62C137  cmp         ebx,10h  
00007FFE8A62C13A  jl          00007FFE8A62C1D4  
00007FFE8A62C140  mov         ecx,dword ptr [r15+8]  
00007FFE8A62C144  cmp         edi,ecx  
00007FFE8A62C146  jae         00007FFE8A62C23D  
00007FFE8A62C14C  lea         r8d,[rdi+0Fh]  
00007FFE8A62C150  cmp         r8d,ecx  
00007FFE8A62C153  jae         00007FFE8A62C23D  
00007FFE8A62C159  movups      xmm0,xmmword ptr [r15+rdi+10h]  

Also the jit now removes the BitConverter.IsLittleEndian test and static initialization gone with it.

283:             knownString = null;
00007FFE8A62D6E2  push        rbx  
00007FFE8A62D6E3  sub         rsp,20h  
00007FFE8A62D6E7  mov         rsi,rcx  
00007FFE8A62D6EA  mov         rdi,rdx  
00007FFE8A62D6ED  mov         rbx,r8  
00007FFE8A62D6F0  mov         rcx,7FFE8A720B28h  
00007FFE8A62D6FA  mov         edx,22h  
00007FFE8A62D6FF  call        00007FFEEA30BC50  
00007FFE8A62D704  xor         ecx,ecx  
00007FFE8A62D706  mov         qword ptr [rbx],rcx  
   289:             }
   290: 
   291:             var inputLength = begin.GetLength(end);
00007FFE8A62D709  mov         rcx,rsi  
00007FFE8A62D70C  mov         rdx,rdi  
00007FFE8A62D70F  call        00007FFE8A5F95D0  
00007FFE8A62D714  mov         edi,eax  

@benaadams
Copy link
Contributor Author

Same issue with ReasonPhrases.ToStatusBytes and Frame header bytes.

Any static readonly that is initialised by a function needs to be accessed prior to the function being jitted or the jitter includes lazy inits before each use.

So I've moved the bytes to Constants also and added a pre-access init in Constants.

ReasonPhrases.ToStatusBytes before:

    83:                     case 102:
    84:                         return _bytesStatus102;
00007FFEA7A8DF42  mov         rcx,7FFEA7970B28h  
00007FFEA7A8DF4C  mov         edx,3Fh  
00007FFEA7A8DF51  call        00007FFF0757BC50  
00007FFEA7A8DF56  mov         rcx,20AED0D7BB0h  
00007FFEA7A8DF60  mov         rax,qword ptr [rcx]  
00007FFEA7A8DF63  jmp         00007FFEA7A8E749  
    85:                     case 200:
    86:                         return _bytesStatus200;
00007FFEA7A8DF68  mov         rcx,7FFEA7970B28h  
00007FFEA7A8DF72  mov         edx,3Fh  
00007FFEA7A8DF77  call        00007FFF0757BC50  
00007FFEA7A8DF7C  mov         rcx,20AED0D7BB8h  
00007FFEA7A8DF86  mov         rax,qword ptr [rcx]  
00007FFEA7A8DF89  jmp         00007FFEA7A8E749  
    87:                     case 201:
    88:                         return _bytesStatus201;
00007FFEA7A8DF8E  mov         rcx,7FFEA7970B28h  
00007FFEA7A8DF98  mov         edx,3Fh  
00007FFEA7A8DF9D  call        00007FFF0757BC50  
00007FFEA7A8DFA2  mov         rcx,20AED0D7BC0h  
00007FFEA7A8DFAC  mov         rax,qword ptr [rcx]  
00007FFEA7A8DFAF  jmp         00007FFEA7A8E749  

ReasonPhrases.ToStatusBytes after

    31:                     case 102:
    32:                         return Constants.HeaderBytesStatus102;
00007FFE9ED7E21A  mov         rax,1A2D68F7A48h  
00007FFE9ED7E224  mov         rax,qword ptr [rax]  
00007FFE9ED7E227  jmp         00007FFE9ED7E639  
    33:                     case 200:
    34:                         return Constants.HeaderBytesStatus200;
00007FFE9ED7E22C  mov         rax,1A2D68F7A50h  
00007FFE9ED7E236  mov         rax,qword ptr [rax]  
00007FFE9ED7E239  jmp         00007FFE9ED7E639  
    35:                     case 201:
    36:                         return Constants.HeaderBytesStatus201;
00007FFE9ED7E23E  mov         rax,1A2D68F7A58h  
00007FFE9ED7E248  mov         rax,qword ptr [rax]  
00007FFE9ED7E24B  jmp         00007FFE9ED7E639  

@@ -20,5 +23,18 @@ internal class Constants
/// for info on the format.
/// </summary>
public const string RFC1123DateFormat = "r";

public readonly static int VectorSpan = Vector<byte>.Count;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vector<byte>.Count should Jit to const when intrinsic; not sure when Vector.IsHardwareAccelerated == false; as the backing static currently isn't readonly (https://github.com/dotnet/corefx/issues/5152) however we know readonly static int does Jit to const; otherwise so intermediate readonly static for that.

@halter73
Copy link
Member

halter73 commented Jan 5, 2016

Numbers

Dev:

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.92ms    6.29ms 267.92ms   98.66%
    Req/Sec    33.40k     2.38k   49.29k    73.11%
  10729980 requests in 10.10s, 1.32GB read
Requests/sec: 1062409.92
Transfer/sec:    133.74MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.05ms    8.17ms 287.47ms   97.08%
    Req/Sec    33.43k     2.35k   44.71k    69.68%
  10742183 requests in 10.10s, 1.32GB read
Requests/sec: 1063577.53
Transfer/sec:    133.89MB


Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.74ms    4.83ms 163.03ms   94.53%
    Req/Sec    33.48k     2.26k   53.69k    70.99%
  10748985 requests in 10.10s, 1.32GB read
Requests/sec: 1064344.57
Transfer/sec:    133.99MB

benaadams/memorypool-seek-merged:

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.55ms    9.60ms 161.03ms   96.67%
    Req/Sec    35.19k     2.79k   64.81k    94.34%
  11268259 requests in 10.10s, 1.39GB read
Requests/sec: 1115674.29
Transfer/sec:    140.45MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.79ms   11.78ms 314.81ms   95.17%
    Req/Sec    35.27k     2.05k   51.67k    84.46%
  11314668 requests in 10.10s, 1.39GB read
Requests/sec: 1120369.65
Transfer/sec:    141.04MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.99ms    7.21ms 162.04ms   94.54%
    Req/Sec    35.18k     1.62k   48.48k    83.29%
  11296570 requests in 10.10s, 1.39GB read
Requests/sec: 1118491.44
Transfer/sec:    140.80MB

benaadams/memorypool-seek-partial-merged (first 2 commits):

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.17ms   21.54ms 373.42ms   96.87%
    Req/Sec    34.64k     1.97k   49.03k    76.42%
  11114611 requests in 10.10s, 1.37GB read
Requests/sec: 1100531.28
Transfer/sec:    138.54MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.48ms   14.79ms 413.11ms   96.52%
    Req/Sec    34.74k     1.83k   57.77k    81.08%
  11151647 requests in 10.09s, 1.37GB read
Requests/sec: 1104677.01
Transfer/sec:    139.06MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     3.79ms   12.38ms 266.26ms   95.56%
    Req/Sec    34.65k     1.95k   51.56k    82.17%
  11118899 requests in 10.10s, 1.37GB read
Requests/sec: 1100971.02
Transfer/sec:    138.60MB

@benaadams
Copy link
Contributor Author

So +4% and +5%

The last commit is working round a bug in the jit; added coreclr issue https://github.com/dotnet/coreclr/issues/2526 and comment on the InitalizeJitConstants function. When resolved that function can be removed.

@benaadams
Copy link
Contributor Author

Again all of the the fixed in this can go away post #525 and + some changes to memory blocks (to expose pointer)

@halter73
Copy link
Member

halter73 commented Jan 5, 2016

Can you move 1eda517 4c39800 and a5d1d06 into their own PR? I think these are ready to merge.

That or do the inverse. Thanks!

@benaadams
Copy link
Contributor Author

Removed the other ones from this

@halter73 halter73 closed this Jan 5, 2016
@benaadams benaadams deleted the memorypool.seek branch May 10, 2016 10:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants