Skip to content

[stdlib] Implement native grapheme breaking for String #37864

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 1, 2021

Conversation

Azoy
Copy link
Contributor

@Azoy Azoy commented Jun 10, 2021

This is a draft and will see many updates, but wanted to put this up to get some benchmark readings.

Resolves: rdar://52194063, https://bugs.swift.org/browse/SR-9423

@Azoy
Copy link
Contributor Author

Azoy commented Jun 10, 2021

@swift-ci Please benchmark

@Azoy
Copy link
Contributor Author

Azoy commented Jun 10, 2021

@swift-ci Please test

@swift-ci

This comment has been minimized.

@Azoy
Copy link
Contributor Author

Azoy commented Jun 11, 2021

@swift-ci Please benchmark

@swift-ci

This comment has been minimized.

@Azoy Azoy force-pushed the native-grapheme-breaking branch 3 times, most recently from 753875e to 17c277f Compare June 22, 2021 16:57
@swiftlang swiftlang deleted a comment from swift-ci Jun 22, 2021
@swiftlang swiftlang deleted a comment from swift-ci Jun 22, 2021
@swiftlang swiftlang deleted a comment from swift-ci Jun 22, 2021
@swiftlang swiftlang deleted a comment from swift-ci Jun 22, 2021
@swiftlang swiftlang deleted a comment from swift-ci Jun 22, 2021
@Azoy Azoy force-pushed the native-grapheme-breaking branch from 17c277f to c05e787 Compare June 22, 2021 18:12
@swiftlang swiftlang deleted a comment from swift-ci Jun 23, 2021
@Azoy
Copy link
Contributor Author

Azoy commented Jun 23, 2021

@swift-ci please benchmark

@swift-ci

This comment has been minimized.

@Azoy Azoy force-pushed the native-grapheme-breaking branch from c05e787 to 716d6f6 Compare August 10, 2021 05:41
@Azoy Azoy force-pushed the native-grapheme-breaking branch from 716d6f6 to 9004029 Compare September 24, 2021 23:36
@Azoy Azoy force-pushed the native-grapheme-breaking branch from 9004029 to effbcb2 Compare October 4, 2021 07:48
Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quick preliminary review, so many of my suggestions are vague and might not actually make sense. But, hopefully it gives you something to chew on or encourages some refactoring or clarifying.

Looking good so far though!

@Azoy Azoy force-pushed the native-grapheme-breaking branch from effbcb2 to aa8da8f Compare October 14, 2021 21:34
@Azoy Azoy marked this pull request as ready for review October 17, 2021 23:42
Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review. This is looking a lot cleaner, thanks for scrapping the iterator pattern!

Refactoring out state helps with the clarity quite a bit. It also shows that index does seem like a different bit of state than the others, so we might pull that out and put it back on the walker or somewhere. Something to consider.

@Azoy Azoy changed the title [WIP][stdlib] Implement native grapheme breaking for String [stdlib] Implement native grapheme breaking for String Oct 18, 2021
@Azoy Azoy force-pushed the native-grapheme-breaking branch from aa8da8f to c385938 Compare October 18, 2021 20:04
@Azoy
Copy link
Contributor Author

Azoy commented Oct 18, 2021

@swift-ci please benchmark

@Azoy Azoy force-pushed the native-grapheme-breaking branch from c385938 to b3bb9ac Compare October 18, 2021 22:47
@Azoy
Copy link
Contributor Author

Azoy commented Oct 18, 2021

@swift-ci please benchmark

@swift-ci

This comment has been minimized.

@Azoy
Copy link
Contributor Author

Azoy commented Oct 21, 2021

@swift-ci please benchmark

@swift-ci

This comment has been minimized.

@Azoy Azoy force-pushed the native-grapheme-breaking branch from 2d04fd3 to 58fb22c Compare October 25, 2021 23:18
Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's hasty, but I wanted to get you some feedback quickly.

Parameterize nextBoundary and previousBoundary
@Azoy Azoy force-pushed the native-grapheme-breaking branch from 58fb22c to 85352c2 Compare October 29, 2021 04:54
@Azoy
Copy link
Contributor Author

Azoy commented Oct 29, 2021

@swift-ci please test

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very close!

For landing, I think we can simplify a lot of the state that's passed around and mutated, especially when going backwards.

For future work, I'm interested in splitting this between forwards and backwards a little (since I think that will actually simplify the logic), but I don't want to hold up landing this for that. Those comments are written with "Future note:" at the front.

// | = We found our starting .extendedPictographic letting us
// know that we are in an emoji sequence so our initial
// break question is answered as NO.
internal func checkIfInEmojiSequence(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future note: One thing that isn't clear to me is how this is different than the walking-backwards loop we're doing anyways inside of previousBoundary.

// GB11
case (.zwj, .extendedPictographic):
if state.isBackwards {
checkIfInEmojiSequence(&state, index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future note: If we're going backwards, seems like the return value is "yes, unless you're in an emoji sequence".

// state variable to false after every decision of 'shouldBreak'. If we
// happen to see a rhs .extend or .zwj, then it's a signal that we should
// continue treating the current grapheme cluster as an emoji sequence.
var enterEmojiSequence = false
Copy link
Member

@milseman milseman Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future note: Is there a way to simplify this logic? It seems to only be essential to forwards-state.

@Azoy
Copy link
Contributor Author

Azoy commented Oct 29, 2021

@swift-ci please test

@Azoy
Copy link
Contributor Author

Azoy commented Oct 29, 2021

@swift-ci please benchmark

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a couple places for comments. I think we should try to merge this soon. There's more future work we could do, but this is great for now!

let cocoa = _object.cocoaObject
// GB11
case (.zwj, .extendedPictographic):
if isBackwards {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment describing what we're doing and why

// GB12 & GB13
case (.regionalIndicator, .regionalIndicator):
if isBackwards {
return countRIs(index)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment, or a more descriptive name like checkPairedRIs.

@swift-ci
Copy link
Contributor

Performance (x86_64): -O

Regression OLD NEW DELTA RATIO
LessSubstringSubstring 22 29 +31.8% 0.76x
EqualSubstringSubstringGenericEquatable 22 29 +31.8% 0.76x
LessSubstringSubstringGenericComparable 22 29 +31.8% 0.76x
EqualSubstringSubstring 22 28 +27.3% 0.79x
EqualStringSubstring 22 28 +27.3% 0.79x
EqualSubstringString 22 28 +27.3% 0.79x
StringComparison_longSharedPrefix 293 330 +12.6% 0.89x (?)
SortStringsUnicode 2045 2255 +10.3% 0.91x
SubstringEqualString 142 155 +9.2% 0.92x
 
Improvement OLD NEW DELTA RATIO
SubstringTrimmingASCIIWhitespace 441 122 -72.3% 3.61x
StringHasSuffixUnicode 142000 56000 -60.6% 2.54x
StringHasPrefixUnicode 83000 34000 -59.0% 2.44x
CSVParsing.Char 225 140 -37.8% 1.61x
StringUTF16SubstringBuilder 2520 1620 -35.7% 1.56x
CharacterPropertiesPrecomputed 850 610 -28.2% 1.39x
CharacterPropertiesStashed 790 580 -26.6% 1.36x
CharacterPropertiesStashedMemo 930 690 -25.8% 1.35x
LineSink.characters.alpha 81 61 -24.7% 1.33x
Breadcrumbs.MutatedUTF16ToIdx.Mixed 232 193 -16.8% 1.20x
LineSink.characters.complex 428 362 -15.4% 1.18x
Breadcrumbs.MutatedIdxToUTF16.Mixed 235 199 -15.3% 1.18x
RemoveWhereQuadraticString 196 166 -15.3% 1.18x
CharacterPropertiesFetch 2200 1960 -10.9% 1.12x
StringAdder 266 237 -10.9% 1.12x (?)
StringBuilderSmallReservingCapacity 213 190 -10.8% 1.12x
StringBuilder 205 184 -10.2% 1.11x
StringUTF16Builder 220 200 -9.1% 1.10x
StringInterpolationSmall 1150 1070 -7.0% 1.07x (?)
DropWhileAnySequence 1353 1260 -6.9% 1.07x (?)
ArraySetElement 304 284 -6.6% 1.07x (?)

Code size: -O

Regression OLD NEW DELTA RATIO
RC4.o 3425 3665 +7.0% 0.93x
 
Improvement OLD NEW DELTA RATIO
UTF8Decode.o 23381 23109 -1.2% 1.01x

Performance (x86_64): -Osize

Regression OLD NEW DELTA RATIO
SuffixAnySequence 122 1900 +1457.4% 0.06x
SuffixSequence 131 1901 +1351.1% 0.07x
SuffixSequenceLazy 131 1788 +1264.9% 0.07x
SuffixArrayLazy 5 9 +80.0% 0.56x
FlattenListLoop 943 1386 +47.0% 0.68x (?)
LessSubstringSubstring 22 29 +31.8% 0.76x
EqualStringSubstring 22 29 +31.8% 0.76x
EqualSubstringSubstringGenericEquatable 22 29 +31.8% 0.76x
EqualSubstringString 22 29 +31.8% 0.76x
LessSubstringSubstringGenericComparable 22 29 +31.8% 0.76x
EqualSubstringSubstring 23 29 +26.1% 0.79x
StringComparison_longSharedPrefix 295 331 +12.2% 0.89x (?)
UTF8Decode_InitFromCustom_noncontiguous 257 287 +11.7% 0.90x (?)
UTF8Decode_InitFromCustom_noncontiguous_ascii 620 689 +11.1% 0.90x
SuffixAnySequenceLazy 3037 3357 +10.5% 0.90x (?)
UTF8Decode_InitFromCustom_noncontiguous_ascii_as_ascii 717 788 +9.9% 0.91x (?)
RemoveWhereMoveInts 21 23 +9.5% 0.91x (?)
SortStringsUnicode 2055 2245 +9.2% 0.92x
ArraySetElement 296 322 +8.8% 0.92x (?)
StringWalk 3240 3520 +8.6% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
SubstringTrimmingASCIIWhitespace 443 125 -71.8% 3.54x
StringHasSuffixUnicode 142000 56000 -60.6% 2.54x
StringHasPrefixUnicode 83000 34000 -59.0% 2.44x
CSVParsing.Char 226 143 -36.7% 1.58x
StringUTF16SubstringBuilder 2570 1640 -36.2% 1.57x
CharacterPropertiesPrecomputed 860 620 -27.9% 1.39x
CharacterPropertiesStashed 790 580 -26.6% 1.36x
LineSink.characters.alpha 82 61 -25.6% 1.34x
CharacterPropertiesStashedMemo 930 700 -24.7% 1.33x
SuffixAnyCollection 36 30 -16.7% 1.20x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed 231 194 -16.0% 1.19x
LineSink.characters.complex 426 359 -15.7% 1.19x
Breadcrumbs.MutatedIdxToUTF16.Mixed 233 199 -14.6% 1.17x (?)
RemoveWhereQuadraticString 199 170 -14.6% 1.17x
CharacterPropertiesFetch 2200 1960 -10.9% 1.12x
StringBuilderSmallReservingCapacity 217 194 -10.6% 1.12x (?)
StringBuilder 210 188 -10.5% 1.12x (?)
StringUTF16Builder 220 200 -9.1% 1.10x
StringAdder 269 245 -8.9% 1.10x (?)
StringInterpolationSmall 1130 1040 -8.0% 1.09x (?)

Code size: -Osize

Regression OLD NEW DELTA RATIO
Suffix.o 18820 24636 +30.9% 0.76x
RC4.o 3000 3202 +6.7% 0.94x
 
Improvement OLD NEW DELTA RATIO
UTF8Decode.o 21726 21407 -1.5% 1.01x

Performance (x86_64): -Onone

Regression OLD NEW DELTA RATIO
LessSubstringSubstringGenericComparable 49 56 +14.3% 0.88x
EqualSubstringSubstringGenericEquatable 50 56 +12.0% 0.89x (?)
LessSubstringSubstring 52 58 +11.5% 0.90x (?)
EqualSubstringSubstring 52 58 +11.5% 0.90x
EqualStringSubstring 52 58 +11.5% 0.90x
EqualSubstringString 52 58 +11.5% 0.90x
 
Improvement OLD NEW DELTA RATIO
StringHasSuffixUnicode 145000 59000 -59.3% 2.46x
StringHasPrefixUnicode 86000 37000 -57.0% 2.32x
SubstringTrimmingASCIIWhitespace 964 579 -39.9% 1.66x
LineSink.characters.alpha 109 88 -19.3% 1.24x
StringUTF16SubstringBuilder 7250 5960 -17.8% 1.22x
CSVParsing.Char 524 434 -17.2% 1.21x
Breadcrumbs.MutatedIdxToUTF16.Mixed 245 209 -14.7% 1.17x (?)
Breadcrumbs.MutatedUTF16ToIdx.Mixed 232 198 -14.7% 1.17x
CharacterPropertiesStashed 1520 1300 -14.5% 1.17x (?)
LineSink.characters.complex 502 434 -13.5% 1.16x
CharacterPropertiesPrecomputed 1960 1730 -11.7% 1.13x (?)
CharacterPropertiesStashedMemo 2230 2000 -10.3% 1.11x (?)
ArrayOfPOD 733 669 -8.7% 1.10x (?)
CharacterPropertiesFetch 2950 2700 -8.5% 1.09x (?)
NSError 476 444 -6.7% 1.07x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac mini
  Model Identifier: Macmini8,1
  Processor Name: 6-Core Intel Core i7
  Processor Speed: 3.2 GHz
  Number of Processors: 1
  Total Number of Cores: 6
  L2 Cache (per Core): 256 KB
  L3 Cache: 12 MB
  Memory: 64 GB

@swift-ci
Copy link
Contributor

Build failed
Swift Test Linux Platform
Git Sha - 16abfe5

@milseman
Copy link
Member

@swift-ci please test linux platform

@xwu
Copy link
Collaborator

xwu commented Oct 30, 2021

Are the -Osize performance and/or code size regressions in the x86_64 Suffix* benchmarks a problem?

@Azoy
Copy link
Contributor Author

Azoy commented Oct 30, 2021

I have no idea why those benchmarks are showing as a regression in this patch (and it seems they showed a 3x improvement with the normalization patch). Considering they don't even touch strings at all, I'm inclined to say they are not an issue, but I could be wrong.

@Azoy
Copy link
Contributor Author

Azoy commented Nov 1, 2021

Here are some benchmarks that I ran locally with the StringWalk benchmarks enabled:

Performance (arm64): -O

Regression OLD NEW DELTA RATIO
ArrayAppendGenericStructs 920 1220 +32.6% 0.75x (?)
CharIteration_utf16_unicodeScalars 1720 2040 +18.6% 0.84x (?)
ParseFloat.Float.Exp 6 7 +16.7% 0.86x (?)
Set.isSubset.Seq.Int.Empty 82 93 +13.4% 0.88x (?)
Set.isStrictSubset.Seq.Int.Empty 82 93 +13.4% 0.88x (?)
ArrayPlusEqualArrayOfInt 180 200 +11.1% 0.90x (?)
ArrayAppendToGeneric 190 210 +10.5% 0.90x (?)
BufferFillFromSlice 10 11 +10.0% 0.91x (?)
StringWalk_punctuated_characters_Backwards 440 480 +9.1% 0.92x (?)
StringWalk_ascii_characters_Backwards 1800 1960 +8.9% 0.92x (?)
DataCreateEmptyArray 650 700 +7.7% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
SubstringTrimmingASCIIWhitespace 314 86 -72.6% 3.65x
StringWalk_utf16_characters 26080 7200 -72.4% 3.62x
StringWalk_utf16_characters_Backwards 27240 10240 -62.4% 2.66x
SuffixCountableRange 13 6 -53.8% 2.17x
StringHasSuffixUnicode 100000 48000 -52.0% 2.08x (?)
Data.append.Sequence.64kB.Count.RE.I 2 1 -50.0% 2.00x (?)
StringHasPrefixUnicode 60000 32000 -46.7% 1.87x (?)
CSVParsing.Char 151 94 -37.7% 1.61x
StringUTF16SubstringBuilder 1620 1090 -32.7% 1.49x (?)
CharacterPropertiesStashed 550 380 -30.9% 1.45x (?)
CharacterPropertiesPrecomputed 690 520 -24.6% 1.33x
LineSink.characters.alpha 52 40 -23.1% 1.30x
ArrayAppendOptionals 920 710 -22.8% 1.30x (?)
StringWalk_tweet_characters_Backwards 4920 4040 -17.9% 1.22x (?)
CharacterPropertiesStashedMemo 800 660 -17.5% 1.21x (?)
LineSink.characters.complex 277 230 -17.0% 1.20x (?)
DataCreateMedium 1200 1000 -16.7% 1.20x (?)
CharacterPropertiesFetch 1510 1350 -10.6% 1.12x (?)
StringWalk_tweet_characters 3320 3000 -9.6% 1.11x (?)
NSArray.bridged.objectAtIndex 229 208 -9.2% 1.10x (?)
ParseInt.UInt64.Hex 243 222 -8.6% 1.09x (?)

Code size: -O

Performance (arm64): -Osize

Regression OLD NEW DELTA RATIO
SuffixSequenceLazy 117 828 +607.7% 0.14x
SuffixAnySequence 117 802 +585.5% 0.15x
SuffixSequence 117 802 +585.5% 0.15x
DataCreateEmptyArray 800 950 +18.7% 0.84x (?)
DataToStringEmpty 400 450 +12.5% 0.89x (?)
SuffixAnySequenceLazy 1633 1814 +11.1% 0.90x (?)
DataCreateSmallArray 1650 1800 +9.1% 0.92x (?)
StringWalk_ascii_characters_Backwards 1880 2040 +8.5% 0.92x (?)
StringWalk_punctuated_characters_Backwards 480 520 +8.3% 0.92x (?)
StringWalk_punctuatedJapanese_characters_Backwards 480 520 +8.3% 0.92x (?)
SortArrayInClass 58036 62725 +8.1% 0.93x (?)
 
Improvement OLD NEW DELTA RATIO
SubstringTrimmingASCIIWhitespace 317 88 -72.2% 3.60x
StringWalk_utf16_characters 26200 7280 -72.2% 3.60x
StringWalk_utf16_characters_Backwards 27160 10280 -62.2% 2.64x
StringHasSuffixUnicode 101000 47000 -53.5% 2.15x (?)
StringHasPrefixUnicode 60000 32000 -46.7% 1.87x (?)
CSVParsing.Char 158 101 -36.1% 1.56x
StringUTF16SubstringBuilder 1670 1130 -32.3% 1.48x (?)
CharacterPropertiesStashedMemo 840 580 -31.0% 1.45x (?)
CharacterPropertiesStashed 630 440 -30.2% 1.43x (?)
SIMDReduce.Int8x16.Cast 43 31 -27.9% 1.39x (?)
CharacterPropertiesPrecomputed 680 500 -26.5% 1.36x
LineSink.characters.alpha 57 45 -21.1% 1.27x
StringWalk_tweet_characters_Backwards 5120 4240 -17.2% 1.21x (?)
LineSink.characters.complex 294 247 -16.0% 1.19x (?)
CharacterPropertiesFetch 1550 1380 -11.0% 1.12x (?)
StringWalk_tweet_characters 3560 3240 -9.0% 1.10x (?)
ConvertFloatingPoint.MockFloat64ToInt64 433 402 -7.2% 1.08x (?)
FindString.Rec3.Substring 75 70 -6.7% 1.07x (?)

Code size: -Osize

Performance (arm64): -Onone

Regression OLD NEW DELTA RATIO
Breadcrumbs.MutatedUTF16ToIdx.ASCII 2 3 +50.0% 0.67x (?)
ArrayOfGenericPOD2 902 1029 +14.1% 0.88x (?)
String.replaceSubrange.String 12 13 +8.3% 0.92x (?)
RawBufferCopyBytes 12 13 +8.3% 0.92x (?)
 
Improvement OLD NEW DELTA RATIO
StringWalk_utf16_characters 29720 10360 -65.1% 2.87x
StringWalk_utf16_characters_Backwards 31560 13960 -55.8% 2.26x
StringHasSuffixUnicode 102000 49000 -52.0% 2.08x (?)
StringHasPrefixUnicode 61000 33000 -45.9% 1.85x (?)
SubstringTrimmingASCIIWhitespace 648 401 -38.1% 1.62x (?)
ArrayPlusEqualArrayOfInt 220 170 -22.7% 1.29x (?)
CharacterPropertiesStashed 1010 830 -17.8% 1.22x (?)
CSVParsing.Char 334 275 -17.7% 1.21x (?)
StringUTF16SubstringBuilder 4390 3710 -15.5% 1.18x (?)
LineSink.characters.alpha 71 61 -14.1% 1.16x (?)
CharacterPropertiesPrecomputed 1430 1230 -14.0% 1.16x (?)
LineSink.characters.complex 330 285 -13.6% 1.16x (?)
CharacterPropertiesStashedMemo 1700 1500 -11.8% 1.13x (?)
CharacterPropertiesFetch 1940 1760 -9.3% 1.10x (?)
StringWalk_tweet_characters_Backwards 13680 12680 -7.3% 1.08x (?)
DataReplaceSmall 1500 1400 -6.7% 1.07x (?)

Code size: -swiftlibs

How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: MacBook Pro
  Model Identifier: MacBookPro17,1
  Total Number of Cores: 8 (4 performance and 4 efficiency)
  Memory: 16 GB

@Azoy Azoy merged commit 5a0bbb9 into swiftlang:main Nov 1, 2021
@Azoy Azoy deleted the native-grapheme-breaking branch November 1, 2021 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants