String.StartsWith Ordinal optimization pt 2 #2667

benaadams · 2016-01-14T22:51:36Z

Calling into native comes with some overhead, which is especially significant for short argument strings. (From aspnet/HttpAbstractions#521 follow up on #1632 for rest of string)

This keeps String.StartsWith(string, StringComparison.Ordinal) in managed code for arguments < 512 chars; also loop unrolls uses wider data types etc.

As is managed code the improvements should apply equally to all platforms. Have #ifdef it for FEATURE_CORECLR only.

Results

This is for worse case comparison where the strings match.

NativeStartsWithOrdinal is current, ManagedStartsWithOrdinal is change x64

Method	Len	AvrTime	StdDev	op/s	Improve
ManagedStartsWithOrdinal	2	4.0052ns	0.0671ns	249,744,319.70	+153.1%
NativeStartsWithOrdinal	2	10.1366ns	0.1697ns	98,679,099.50	-
ManagedStartsWithOrdinal	3	4.0089ns	0.0694ns	249,517,852.81	+174.6%
NativeStartsWithOrdinal	3	11.0067ns	0.1886ns	90,880,104.57	-
ManagedStartsWithOrdinal	4	4.1074ns	0.0219ns	243,466,747.38	+185.5%
NativeStartsWithOrdinal	4	11.7261ns	0.0369ns	85,280,783.99	-
ManagedStartsWithOrdinal	5	4.3801ns	1.4955ns	238,078,604.60	+185.9%
NativeStartsWithOrdinal	5	12.0108ns	0.2094ns	83,282,838.04	-
ManagedStartsWithOrdinal	6	4.3481ns	0.0785ns	230,057,624.51	+149.5%
NativeStartsWithOrdinal	6	10.8458ns	0.0382ns	92,202,381.48	-
ManagedStartsWithOrdinal	7	4.2902ns	0.0756ns	233,158,343.59	+173.7%
NativeStartsWithOrdinal	7	11.7399ns	0.0506ns	85,181,018.69	-
ManagedStartsWithOrdinal	8	4.7742ns	0.1198ns	209,588,833.02	+152.6%
NativeStartsWithOrdinal	8	12.0580ns	0.2386ns	82,963,662.20	-
ManagedStartsWithOrdinal	9	4.7172ns	0.0372ns	212,003,612.92	+170.1%
NativeStartsWithOrdinal	9	12.7465ns	0.2633ns	78,485,475.17	-
ManagedStartsWithOrdinal	10	5.0009ns	0.1068ns	200,055,190.98	+128.8%
NativeStartsWithOrdinal	10	11.4378ns	0.0517ns	87,430,778.79	-
ManagedStartsWithOrdinal	15	5.5410ns	2.0315ns	189,144,906.11	+143.8%
NativeStartsWithOrdinal	15	12.8905ns	0.0076ns	77,576,243.92	-
ManagedStartsWithOrdinal	16	6.7168ns	1.2771ns	151,580,392.59	+104.5%
NativeStartsWithOrdinal	16	13.4929ns	0.0615ns	74,114,322.99	-
ManagedStartsWithOrdinal	17	6.7303ns	1.6666ns	152,640,298.48	+115.2%
NativeStartsWithOrdinal	17	14.1002ns	0.0610ns	70,922,255.44	-
ManagedStartsWithOrdinal	23	7.3643ns	2.3584ns	142,538,976.63	+98.2%
NativeStartsWithOrdinal	23	13.9077ns	0.2307ns	71,922,061.67	-
ManagedStartsWithOrdinal	24	7.1233ns	0.1394ns	140,437,643.35	+103.2%
NativeStartsWithOrdinal	24	14.4703ns	0.2629ns	69,129,538.57	-
ManagedStartsWithOrdinal	25	7.1311ns	0.1326ns	140,277,720.07	+111.3%
NativeStartsWithOrdinal	25	15.0710ns	0.2860ns	66,375,829.39	-
ManagedStartsWithOrdinal	31	7.5545ns	0.2379ns	132,489,110.68	+99.5%
NativeStartsWithOrdinal	31	15.0624ns	0.2548ns	66,409,197.41	-
ManagedStartsWithOrdinal	32	7.3628ns	0.1546ns	135,874,173.21	+115.0%
NativeStartsWithOrdinal	32	15.8253ns	0.0428ns	63,190,410.16	-
ManagedStartsWithOrdinal	33	7.3799ns	0.2234ns	135,615,415.21	+122.5%
NativeStartsWithOrdinal	33	16.4042ns	0.0081ns	60,960,104.45	-
ManagedStartsWithOrdinal	39	7.7385ns	0.2725ns	129,362,550.74	+107.3%
NativeStartsWithOrdinal	39	16.0278ns	0.2819ns	62,410,223.47	-
ManagedStartsWithOrdinal	40	8.5779ns	2.1443ns	121,025,780.58	+105.6%
NativeStartsWithOrdinal	40	16.9862ns	0.0436ns	58,871,649.71	-
ManagedStartsWithOrdinal	41	8.7882ns	2.4842ns	119,135,835.72	+106.9%
NativeStartsWithOrdinal	41	17.3736ns	0.2933ns	57,574,525.30	-
ManagedStartsWithOrdinal	47	8.4154ns	0.2685ns	118,931,636.56	+109.0%
NativeStartsWithOrdinal	47	17.5703ns	0.0295ns	56,914,274.70	-
ManagedStartsWithOrdinal	48	9.4871ns	2.9241ns	112,060,274.14	+103.7%
NativeStartsWithOrdinal	48	18.1746ns	0.0512ns	55,022,369.85	-
ManagedStartsWithOrdinal	49	8.8741ns	2.2628ns	117,111,320.10	+117.0%
NativeStartsWithOrdinal	49	18.5388ns	0.3370ns	53,958,434.56	-
ManagedStartsWithOrdinal	55	8.5552ns	0.3731ns	117,082,289.20	+117.1%
NativeStartsWithOrdinal	55	18.5490ns	0.3260ns	53,927,654.25	-
ManagedStartsWithOrdinal	56	8.7798ns	0.1277ns	113,921,153.05	+117.9%
NativeStartsWithOrdinal	56	19.1315ns	0.3144ns	52,283,496.83	-
ManagedStartsWithOrdinal	57	8.9098ns	0.2010ns	112,289,539.82	+121.0%
NativeStartsWithOrdinal	57	19.6849ns	0.3403ns	50,815,327.94	-
ManagedStartsWithOrdinal	63	8.9971ns	0.0621ns	111,152,591.31	+118.8%
NativeStartsWithOrdinal	63	19.6945ns	0.3447ns	50,790,785.29	-
ManagedStartsWithOrdinal	64	8.6365ns	0.1763ns	115,834,472.25	+137.6%
NativeStartsWithOrdinal	64	20.5101ns	0.0152ns	48,756,472.34	-
ManagedStartsWithOrdinal	65	8.8134ns	0.1938ns	113,515,687.87	+133.7%
NativeStartsWithOrdinal	65	20.5937ns	0.3608ns	48,572,691.61	-
ManagedStartsWithOrdinal	95	11.9802ns	1.7585ns	84,661,262.37	+103.2%
NativeStartsWithOrdinal	95	24.0109ns	0.4561ns	41,662,022.13	-
ManagedStartsWithOrdinal	96	11.9479ns	1.7785ns	85,031,711.79	+141.5%
NativeStartsWithOrdinal	96	28.5653ns	2.1210ns	35,205,870.74	-
ManagedStartsWithOrdinal	97	12.2576ns	1.8786ns	82,943,412.75	+114.0%
NativeStartsWithOrdinal	97	25.8010ns	0.0674ns	38,758,450.70	-
ManagedStartsWithOrdinal	100	13.0203ns	2.2968ns	78,699,075.76	+98.2%
NativeStartsWithOrdinal	100	25.1867ns	0.4323ns	39,714,642.76	-
ManagedStartsWithOrdinal	127	14.6869ns	1.5608ns	68,684,975.90	+101.3%
NativeStartsWithOrdinal	127	29.3134ns	0.0855ns	34,114,370.89	-
ManagedStartsWithOrdinal	128	14.5981ns	1.4783ns	69,070,179.67	+106.5%
NativeStartsWithOrdinal	128	29.9038ns	0.0327ns	33,440,632.27	-
ManagedStartsWithOrdinal	129	15.1645ns	1.6051ns	66,551,676.86	+100.5%
NativeStartsWithOrdinal	129	30.1423ns	0.5061ns	33,185,159.81	-
ManagedStartsWithOrdinal	255	25.1756ns	1.0345ns	39,783,333.21	+107.1%
NativeStartsWithOrdinal	255	52.0581ns	0.1860ns	19,209,537.11	-
ManagedStartsWithOrdinal	256	26.0471ns	1.7805ns	38,550,579.75	+99.5%
NativeStartsWithOrdinal	256	51.7690ns	0.9111ns	19,322,298.01	-
ManagedStartsWithOrdinal	257	26.0985ns	1.5864ns	38,445,624.87	+113.3%
NativeStartsWithOrdinal	257	55.4758ns	0.1344ns	18,025,969.55	-
ManagedStartsWithOrdinal	511	46.7212ns	2.0730ns	21,444,042.76	+71.9%
NativeStartsWithOrdinal	511	80.1409ns	0.0848ns	12,478,037.67	-
ManagedStartsWithOrdinal	512	47.2675ns	1.7022ns	21,182,853.86	+72.1%
NativeStartsWithOrdinal	512	81.2225ns	0.2300ns	12,311,950.20	-

Graphed

Yellow axis is assuming each extra char has fixed cost set as the cost for length 16 ManagedStartsWithOrdinal.

X-axis tick per test

X-axis uniform

Details

Function RyuJit x64 asm https://gist.github.com/benaadams/7b17a4171ec7e9b81bbe

Verify and Benchmark: https://gist.github.com/benaadams/792d2734ef569d45be42

Break vs return false https://gist.github.com/adamsitnik/9d4f0107bdc15a802bbf#file-x86jit_break_vs_return_false

jkotas · 2016-01-15T04:33:22Z

Calling into native comes with some overhead

The overhead of calling FCall is same as calling another managed method. It is not where your improvements are coming from. Your improvements are coming from:

Avoiding redundant argument validation
More extensive manual loop unrolling than what's in the current native implementation

jkotas · 2016-01-15T04:35:31Z

src/mscorlib/src/System/String.cs

+        {
+            var byteCount = startsWith.Length << 1;
+            // value.Length verified to be less than or equal to this.Length by calling function
+            if (byteCount > 512)


If I am reading your data correctly, the fcall is always slower. Why to call it for larger strings?

jkotas · 2016-01-15T04:58:50Z

This change will need to be ported to CoreRT. You may want to take a look what has been done there, so that the two are not diverging.

jkotas · 2016-01-15T05:00:40Z

cc @bbowyersmyth

justinvp · 2016-01-15T07:46:21Z

src/mscorlib/src/System/String.cs

+            }
+
+            fixed (char* cpString = this)
+            fixed (char* cpStartsWith = startsWith)


The above two lines can be:

fixed (char* cpString = &m_firstChar) fixed (char* cpStartsWith = &startsWith.m_firstChar)

See #2636

benaadams · 2016-01-15T08:06:37Z

The overhead of calling FCall is same as calling another managed method.

Was assuming there was a cost in the prolog and epilog in storing and restoring the registers from the BlockCopy testing that a managed call didn't necessarily pay https://github.com/dotnet/coreclr/issues/2430#issuecomment-166594959

Will rework the benchmark to include the extra validation/preamble that happens in .StartsWith

bbowyersmyth · 2016-01-15T09:02:34Z

Unrolling to 64 chars in a startswith comparison feels like more of an edge case than what would be common. I can only really think of url compares that would really benefit from that.

I'm curious if a specialised compare would be better for that that worked backwards from the end where it is more likely to be different. A match would probably be slower though.

benaadams · 2016-01-18T13:21:18Z

Need to look into this further.

bbowyersmyth · 2016-01-19T02:36:57Z

Mind if I take a look at this @benaadams ? It looks like a variation of EqualsHelper performs pretty well up to 100 chars and might be able to be used for Equals itself while still remaining pretty simple code.

benaadams · 2016-01-19T03:08:04Z

@bbowyersmyth sure; I did some changes to exactly mirror the calling function (validation, early comparisions, full case etc); changed to gotos, statements to if rather than while, added an extra start char* == to go aligned, removed the large unroll (changed second to be while loop)... and...

Saw the start-up perf drop off so the original and the changed were more or less equal and it behaved as @jkotas's opening statement, with the newer loop unrolling pulling ahead as it got longer - but otherwise the same.

I'm not currently sure why though; the validation bit seems to change the function call from a regular pass through to one that pushes 8+ registers to stack and then pops the same registers back at end. Didn't have time to look deeper.

The loop unrolling and gotos seems to produce nice clean and fast asm though - for that portion at least; unless the use of goto triggers some kind of stack canary / protection and that's what I'm seeing? As does the call to native in the original?.

Faster string ordinal check

3ddf550

dnfclas added the cla-already-signed label Jan 14, 2016

benaadams mentioned this pull request Jan 14, 2016

Faster StartsWithSegments Ordinal check aspnet/HttpAbstractions#521

Closed

benaadams changed the title ~~String.StartsWith Ordinal performance~~ String.StartsWith Ordinal optimization pt 2 Jan 15, 2016

jkotas reviewed Jan 15, 2016
View reviewed changes

justinvp reviewed Jan 15, 2016
View reviewed changes

benaadams closed this Jan 18, 2016

bbowyersmyth mentioned this pull request Jan 23, 2016

String.StartsWith performance - OrdinalCompareSubstring #2825

Merged

benaadams deleted the faster-starts-with branch September 8, 2016 07:07

bbowyersmyth mentioned this pull request Jan 31, 2020

Possible JIT improvement for loops with return statements dotnet/runtime#7474

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String.StartsWith Ordinal optimization pt 2 #2667

String.StartsWith Ordinal optimization pt 2 #2667

benaadams commented Jan 14, 2016

jkotas commented Jan 15, 2016

jkotas Jan 15, 2016

jkotas commented Jan 15, 2016

jkotas commented Jan 15, 2016

justinvp Jan 15, 2016

benaadams commented Jan 15, 2016

bbowyersmyth commented Jan 15, 2016

benaadams commented Jan 18, 2016

bbowyersmyth commented Jan 19, 2016

benaadams commented Jan 19, 2016

String.StartsWith Ordinal optimization pt 2 #2667

String.StartsWith Ordinal optimization pt 2 #2667

Conversation

benaadams commented Jan 14, 2016

Results

Graphed

X-axis tick per test

X-axis uniform

Details

jkotas commented Jan 15, 2016

jkotas Jan 15, 2016

Choose a reason for hiding this comment

jkotas commented Jan 15, 2016

jkotas commented Jan 15, 2016

justinvp Jan 15, 2016

Choose a reason for hiding this comment

benaadams commented Jan 15, 2016

bbowyersmyth commented Jan 15, 2016

benaadams commented Jan 18, 2016

bbowyersmyth commented Jan 19, 2016

benaadams commented Jan 19, 2016