-
Notifications
You must be signed in to change notification settings - Fork 2.6k
String.StartsWith Ordinal optimization pt 2 #2667
Conversation
The overhead of calling FCall is same as calling another managed method. It is not where your improvements are coming from. Your improvements are coming from:
|
{ | ||
var byteCount = startsWith.Length << 1; | ||
// value.Length verified to be less than or equal to this.Length by calling function | ||
if (byteCount > 512) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am reading your data correctly, the fcall is always slower. Why to call it for larger strings?
This change will need to be ported to CoreRT. You may want to take a look what has been done there, so that the two are not diverging. |
} | ||
|
||
fixed (char* cpString = this) | ||
fixed (char* cpStartsWith = startsWith) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above two lines can be:
fixed (char* cpString = &m_firstChar)
fixed (char* cpStartsWith = &startsWith.m_firstChar)
See #2636
Was assuming there was a cost in the prolog and epilog in storing and restoring the registers from the Will rework the benchmark to include the extra validation/preamble that happens in |
Unrolling to 64 chars in a startswith comparison feels like more of an edge case than what would be common. I can only really think of url compares that would really benefit from that. I'm curious if a specialised compare would be better for that that worked backwards from the end where it is more likely to be different. A match would probably be slower though. |
Need to look into this further. |
Mind if I take a look at this @benaadams ? It looks like a variation of |
@bbowyersmyth sure; I did some changes to exactly mirror the calling function (validation, early comparisions, full case etc); changed to gotos, statements to if rather than while, added an extra start char* == to go aligned, removed the large unroll (changed second to be while loop)... and... Saw the start-up perf drop off so the original and the changed were more or less equal and it behaved as @jkotas's opening statement, with the newer loop unrolling pulling ahead as it got longer - but otherwise the same. I'm not currently sure why though; the validation bit seems to change the function call from a regular pass through to one that pushes 8+ registers to stack and then pops the same registers back at end. Didn't have time to look deeper. The loop unrolling and gotos seems to produce nice clean and fast asm though - for that portion at least; unless the use of goto triggers some kind of stack canary / protection and that's what I'm seeing? As does the call to native in the original?. |
Calling into native comes with some overhead, which is especially significant for short argument strings. (From aspnet/HttpAbstractions#521 follow up on #1632 for rest of string)
This keeps
String.StartsWith(string, StringComparison.Ordinal)
in managed code for arguments < 512 chars; also loop unrolls uses wider data types etc.As is managed code the improvements should apply equally to all platforms. Have
#ifdef
it forFEATURE_CORECLR
only.Results
This is for worse case comparison where the strings match.
NativeStartsWithOrdinal is current, ManagedStartsWithOrdinal is change x64
Graphed
Yellow axis is assuming each extra char has fixed cost set as the cost for length 16
ManagedStartsWithOrdinal
.X-axis tick per test
X-axis uniform
Details
Function RyuJit x64 asm https://gist.github.com/benaadams/7b17a4171ec7e9b81bbe
Verify and Benchmark: https://gist.github.com/benaadams/792d2734ef569d45be42
Break vs return false https://gist.github.com/adamsitnik/9d4f0107bdc15a802bbf#file-x86jit_break_vs_return_false