Skip to content

[do not merge] Evaluate the hot/cold splitting pass #21016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

vedantk
Copy link
Contributor

@vedantk vedantk commented Dec 4, 2018

This PR is a sanity-check for hot/cold splitting in the swift compiler. It's not meant to be merged. The goal is to get a rough idea of the effectiveness of the pass by gathering some basic performance numbers.

Caveats:

  • Outlined cold code will not be relocated towards the end of the text segment, as we cannot test with a modified linker. Based on prior experiments we expect this to lower performance.
  • No swift-specific heuristics for marking cold basic blocks are being evaluated (that might be an interesting follow-up).

(cherry picked from commit a5e427732d08c35bc2a67d10f8d5140475a02e01)
@vedantk vedantk requested a review from a team as a code owner December 4, 2018 21:53
@vedantk
Copy link
Contributor Author

vedantk commented Dec 4, 2018

apple/swift-llvm#127
@swift-ci Please smoke benchmark

@vedantk
Copy link
Contributor Author

vedantk commented Dec 4, 2018

@swift-ci Please clean smoke test OS X platform

@vedantk
Copy link
Contributor Author

vedantk commented Dec 4, 2018

apple/swift-llvm#127
@swift-ci Please smoke test OS X platform

@vedantk
Copy link
Contributor Author

vedantk commented Dec 4, 2018

There are some decent performance improvements in a few benchmarks (NopDeinit is 1.31x faster at -O) mixed with a few regressions (Walsh is 0.85x as fast). As mentioned in the PR description, using a modified linker which co-locates cold/outlined symbols should give a significant improvement here.

Hot/cold splitting seems to have a negative effect on code size, especially with integer-heavy benchmarks which (presumably) contain many outlinable traps. Tweaking the outlining code size threshold should improve the results. If we ever want this optimization in swift, we might consider disabling it in -Osize.

@swift-ci
Copy link
Contributor

swift-ci commented Dec 4, 2018

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
Walsh 357 421 +17.9% 0.85x
IterateData 1398 1566 +12.0% 0.89x
Improvement
NopDeinit 59577 45620 -23.4% 1.31x
StringBuilderSmallReservingCapacity 387 361 -6.7% 1.07x
StringAdder 477 445 -6.7% 1.07x
StringBuilder 375 350 -6.7% 1.07x

Code size: -O

TEST OLD NEW DELTA RATIO
Regression
SortIntPyramids.o 12661 17528 +38.4% 0.72x
SortLettersInPlace.o 8879 11818 +33.1% 0.75x
NibbleSort.o 12314 16389 +33.1% 0.75x
StaticArray.o 14045 18304 +30.3% 0.77x
RGBHistogram.o 27556 34120 +23.8% 0.81x
RangeAssignment.o 4940 6005 +21.6% 0.82x
Histogram.o 4187 5040 +20.4% 0.83x
WordCount.o 44244 52095 +17.7% 0.85x
StringEdits.o 12935 14809 +14.5% 0.87x
DriverUtils.o 153141 174024 +13.6% 0.88x
Walsh.o 9164 10373 +13.2% 0.88x
DictionaryLiteral.o 1360 1538 +13.1% 0.88x
RemoveWhere.o 26695 30089 +12.7% 0.89x
ArrayOfGenericRef.o 15036 16927 +12.6% 0.89x
SequenceAlgos.o 20731 23326 +12.5% 0.89x
SortStrings.o 27936 31147 +11.5% 0.90x
PopFrontGeneric.o 4734 5255 +11.0% 0.90x
ReversedCollections.o 11179 12391 +10.8% 0.90x
SortLargeExistentials.o 20694 22909 +10.7% 0.90x
StringRemoveDupes.o 7568 8378 +10.7% 0.90x
ArrayOfRef.o 12338 13653 +10.7% 0.90x
CSVParsing.o 31913 35300 +10.6% 0.90x
ClassArrayGetter.o 5639 6237 +10.6% 0.90x
Substring.o 18215 20077 +10.2% 0.91x
Phonebook.o 11660 12844 +10.2% 0.91x
HashQuadratic.o 5508 6061 +10.0% 0.91x
RandomShuffle.o 3691 4060 +10.0% 0.91x
TwoSum.o 5540 6093 +10.0% 0.91x
PopFront.o 5213 5726 +9.8% 0.91x
COWTree.o 13188 14431 +9.4% 0.91x
DictionaryRemove.o 17158 18728 +9.2% 0.92x
UTF8Decode.o 12378 13463 +8.8% 0.92x
StringInterpolation.o 7355 7991 +8.6% 0.92x
DictionaryKeysContains.o 11815 12824 +8.5% 0.92x
DictTest2.o 15589 16891 +8.4% 0.92x
DictionaryCompactMapValues.o 21038 22769 +8.2% 0.92x
StringMatch.o 4430 4792 +8.2% 0.92x
DataBenchmarks.o 55956 60295 +7.8% 0.93x
NopDeinit.o 5552 5967 +7.5% 0.93x
Suffix.o 26345 28260 +7.3% 0.93x
DictionaryCopy.o 8560 9177 +7.2% 0.93x
DictTest.o 19191 20474 +6.7% 0.94x
DictionarySwap.o 27687 29516 +6.6% 0.94x
DropLast.o 26451 28166 +6.5% 0.94x
StringBuilder.o 7338 7774 +5.9% 0.94x
DictionaryOfAnyHashableStrings.o 11101 11734 +5.7% 0.95x
LuhnAlgoLazy.o 10996 11599 +5.5% 0.95x
LuhnAlgoEager.o 10998 11601 +5.5% 0.95x
DictTest3.o 23877 25179 +5.5% 0.95x
Queue.o 14315 15094 +5.4% 0.95x
DictOfArraysToArrayOfDicts.o 30120 31705 +5.3% 0.95x
DictionaryGroup.o 17124 18019 +5.2% 0.95x
FlattenList.o 6312 6635 +5.1% 0.95x
Hash.o 39090 40972 +4.8% 0.95x
ExistentialPerformance.o 69131 72271 +4.5% 0.96x
ObjectiveCBridging.o 42872 44794 +4.5% 0.96x
DictTest4.o 25037 26121 +4.3% 0.96x
Radix2CooleyTukey.o 5070 5287 +4.3% 0.96x
RC4.o 4715 4908 +4.1% 0.96x
StringComparison.o 44294 46089 +4.1% 0.96x
DictTest4Legacy.o 26479 27547 +4.0% 0.96x
DictionarySubscriptDefault.o 29441 30621 +4.0% 0.96x
Ackermann.o 1852 1925 +3.9% 0.96x
RomanNumbers.o 5311 5497 +3.5% 0.97x
DictionaryBridgeToObjC.o 6165 6366 +3.3% 0.97x
SetTests.o 64400 66467 +3.2% 0.97x
ReduceInto.o 17929 18499 +3.2% 0.97x
ObjectiveCBridgingStubs.o 19315 19924 +3.2% 0.97x
RecursiveOwnedParameter.o 1382 1420 +2.7% 0.97x
Combos.o 7409 7604 +2.6% 0.97x
ArraySubscript.o 4028 4133 +2.6% 0.97x
CountAlgo.o 13373 13704 +2.5% 0.98x
Join.o 2288 2344 +2.4% 0.98x
DictionaryBridge.o 3374 3455 +2.4% 0.98x
ArrayAppend.o 39272 40211 +2.4% 0.98x
ObserverForwarderStruct.o 3594 3673 +2.2% 0.98x
CharacterProperties.o 19061 19457 +2.1% 0.98x
StringWalk.o 40666 41476 +2.0% 0.98x
MonteCarloE.o 3324 3389 +2.0% 0.98x
TestsUtils.o 23723 24150 +1.8% 0.98x
Prefix.o 24345 24780 +1.8% 0.98x
DropFirst.o 25044 25479 +1.7% 0.98x
ObserverClosure.o 3279 3335 +1.7% 0.98x
Array2D.o 4232 4304 +1.7% 0.98x
ObserverUnappliedMethod.o 5266 5351 +1.6% 0.98x
Hanoi.o 3601 3657 +1.6% 0.98x
Prims.o 42945 43600 +1.5% 0.98x
PrimsSplit.o 42997 43652 +1.5% 0.98x
ObserverPartiallyAppliedMethod.o 3567 3620 +1.5% 0.99x
BitCount.o 1876 1901 +1.3% 0.99x
RangeReplaceableCollectionPlusDefault.o 6317 6390 +1.2% 0.99x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
IterateData 1357 1566 +15.4% 0.87x
CaptureProp 4286 4857 +13.3% 0.88x
PrefixAnySeqCntRangeLazy 159 176 +10.7% 0.90x
CharIteration_russian_unicodeScalars_Backwards 5152 5629 +9.3% 0.92x
DataCountSmall 34 37 +8.8% 0.92x
DataCountMedium 37 40 +8.1% 0.93x
Improvement
NopDeinit 57656 44840 -22.2% 1.29x
BitCount 190 171 -10.0% 1.11x
FlattenListLoop 4431 4063 -8.3% 1.09x (?)
Array2D 7505 6909 -7.9% 1.09x
SortAdjacentIntPyramids 1314 1219 -7.2% 1.08x (?)
StringBuilder 370 344 -7.0% 1.08x
MapReduce 433 404 -6.7% 1.07x
MapReduceAnyCollection 436 407 -6.7% 1.07x

Code size: -Osize

TEST OLD NEW DELTA RATIO
Regression
StaticArray.o 13025 18104 +39.0% 0.72x
NibbleSort.o 14122 18789 +33.0% 0.75x
SortLettersInPlace.o 8862 11634 +31.3% 0.76x
RGBHistogram.o 27391 35059 +28.0% 0.78x
SortIntPyramids.o 12353 15736 +27.4% 0.79x
Histogram.o 4032 5008 +24.2% 0.81x
RangeAssignment.o 5101 6309 +23.7% 0.81x
Walsh.o 6156 7557 +22.8% 0.81x
SortStrings.o 28887 34643 +19.9% 0.83x
WordCount.o 40500 47679 +17.7% 0.85x
Phonebook.o 12132 14268 +17.6% 0.85x
Queue.o 13091 15270 +16.6% 0.86x
RandomShuffle.o 3767 4316 +14.6% 0.87x
ReversedCollections.o 11596 13237 +14.2% 0.88x
RemoveWhere.o 24438 27729 +13.5% 0.88x
HashQuadratic.o 5160 5853 +13.4% 0.88x
TwoSum.o 5373 6061 +12.8% 0.89x
DriverUtils.o 132709 148736 +12.1% 0.89x
DictionaryLiteral.o 1509 1690 +12.0% 0.89x
DictionaryRemove.o 15683 17536 +11.8% 0.89x
DictionaryGroup.o 16336 18235 +11.6% 0.90x
SequenceAlgos.o 21908 24447 +11.6% 0.90x
COWTree.o 13674 15253 +11.5% 0.90x
FlattenList.o 6696 7459 +11.4% 0.90x
ClassArrayGetter.o 5673 6301 +11.1% 0.90x
StringEdits.o 11982 13305 +11.0% 0.90x
PopFrontGeneric.o 4823 5351 +10.9% 0.90x
NopDeinit.o 6244 6927 +10.9% 0.90x
StringRemoveDupes.o 7577 8386 +10.7% 0.90x
ArrayOfRef.o 13162 14565 +10.7% 0.90x
SortLargeExistentials.o 21302 23565 +10.6% 0.90x
PopFront.o 5014 5542 +10.5% 0.90x
DictionaryKeysContains.o 11503 12712 +10.5% 0.90x
DictionaryCompactMapValues.o 19518 21569 +10.5% 0.90x
CaptureProp.o 1093 1200 +9.8% 0.91x
CSVParsing.o 31745 34828 +9.7% 0.91x
DictTest2.o 14466 15835 +9.5% 0.91x
UTF8Decode.o 11873 12959 +9.1% 0.92x
DictionarySwap.o 26871 29324 +9.1% 0.92x
Suffix.o 24577 26820 +9.1% 0.92x
DictionaryCopy.o 7945 8609 +8.4% 0.92x
StringMatch.o 4393 4760 +8.4% 0.92x
Substring.o 15906 17213 +8.2% 0.92x
StringBuilder.o 7206 7782 +8.0% 0.93x
DictTest.o 18034 19466 +7.9% 0.93x
DropLast.o 24331 26166 +7.5% 0.93x
DictionaryOfAnyHashableStrings.o 10757 11542 +7.3% 0.93x
RomanNumbers.o 5630 6009 +6.7% 0.94x
ReduceInto.o 13314 14203 +6.7% 0.94x
ArrayOfGenericRef.o 13636 14527 +6.5% 0.94x
Hash.o 20623 21964 +6.5% 0.94x
ExistentialPerformance.o 62683 66703 +6.4% 0.94x
ArrayOfPOD.o 2735 2910 +6.4% 0.94x
LazyFilter.o 8841 9404 +6.4% 0.94x
DictTest3.o 21794 23163 +6.3% 0.94x
DictionarySubscriptDefault.o 27513 29125 +5.9% 0.94x
RC4.o 3833 4052 +5.7% 0.95x
ChainedFilterMap.o 3204 3385 +5.6% 0.95x
DataBenchmarks.o 51029 53872 +5.6% 0.95x
CountAlgo.o 12944 13624 +5.3% 0.95x
DictTest4.o 20839 21897 +5.1% 0.95x
LuhnAlgoLazy.o 13940 14647 +5.1% 0.95x
LuhnAlgoEager.o 13942 14649 +5.1% 0.95x
SetTests.o 57712 60587 +5.0% 0.95x
DictTest4Legacy.o 23833 25011 +4.9% 0.95x
ObserverForwarderStruct.o 3838 4025 +4.9% 0.95x
DictOfArraysToArrayOfDicts.o 30608 32097 +4.9% 0.95x
MonteCarloE.o 3690 3869 +4.9% 0.95x
ObjectiveCBridging.o 40407 42290 +4.7% 0.96x
StringInterpolation.o 6690 6991 +4.5% 0.96x
Radix2CooleyTukey.o 4756 4951 +4.1% 0.96x
Prefix.o 22289 23188 +4.0% 0.96x
DropFirst.o 22412 23311 +4.0% 0.96x
RecursiveOwnedParameter.o 1313 1364 +3.9% 0.96x
ArrayAppend.o 32412 33607 +3.7% 0.96x
Ackermann.o 1957 2029 +3.7% 0.96x
ObjectiveCBridgingStubs.o 18411 19052 +3.5% 0.97x
ObserverUnappliedMethod.o 5571 5759 +3.4% 0.97x
ObserverClosure.o 3573 3688 +3.2% 0.97x
StringComparison.o 38926 40161 +3.2% 0.97x
ArraySubscript.o 3914 4037 +3.1% 0.97x
DictionaryBridgeToObjC.o 5997 6174 +3.0% 0.97x
Prims.o 39013 40120 +2.8% 0.97x
PrimsSplit.o 39065 40172 +2.8% 0.97x
RangeReplaceableCollectionPlusDefault.o 5757 5918 +2.8% 0.97x
ObserverPartiallyAppliedMethod.o 3927 4036 +2.8% 0.97x
Hanoi.o 3810 3913 +2.7% 0.97x
CharacterProperties.o 19317 19801 +2.5% 0.98x
StringWalk.o 34866 35732 +2.5% 0.98x
TestsUtils.o 18947 19398 +2.4% 0.98x
Combos.o 7809 7972 +2.1% 0.98x
Join.o 2565 2616 +2.0% 0.98x
PrefixWhile.o 21246 21609 +1.7% 0.98x
DictionaryBridge.o 3500 3559 +1.7% 0.98x
DropWhile.o 21932 22223 +1.3% 0.99x
MapReduce.o 26827 27174 +1.3% 0.99x
Array2D.o 4379 4432 +1.2% 0.99x
Fibonacci.o 1642 1661 +1.2% 0.99x
main.o 56785 57400 +1.1% 0.99x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Regression
ArrayOfPOD 774 860 +11.1% 0.90x (?)
Improvement
RemoveWhereMoveStrings 3261 2881 -11.7% 1.13x
RemoveWhereMoveInts 2733 2477 -9.4% 1.10x
Memset 12597 11446 -9.1% 1.10x
XorLoop 8021 7304 -8.9% 1.10x

Code size: -swiftlibs

TEST OLD NEW DELTA RATIO
Regression
libswiftSIMDOperators.dylib 45056 49152 +9.1% 0.92x
libswiftAppKit.dylib 77824 81920 +5.3% 0.95x
libswiftSwiftOnoneSupport.dylib 163840 172032 +5.0% 0.95x
libswiftFoundation.dylib 1523712 1576960 +3.5% 0.97x
libswiftCore.dylib 3444736 3559424 +3.3% 0.97x
libswiftStdlibUnittest.dylib 380928 393216 +3.2% 0.97x
libswiftNetwork.dylib 159744 163840 +2.6% 0.98x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

@vedantk
Copy link
Contributor Author

vedantk commented Dec 4, 2018

apple/swift-llvm#127
@swift-ci Please smoke benchmark

@vedantk
Copy link
Contributor Author

vedantk commented Dec 4, 2018

^ I've kicked off another smoke-benchmark run with the outlining threshold bumped up.

@swift-ci
Copy link
Contributor

swift-ci commented Dec 5, 2018

Build comment file:

Performance: -O

TEST OLD NEW DELTA RATIO
Regression
Hanoi 3499 3936 +12.5% 0.89x
IterateData 1397 1552 +11.1% 0.90x
Improvement
NSStringConversion 866 592 -31.6% 1.46x
StringEqualPointerComparison 657 600 -8.7% 1.09x

Code size: -O

TEST OLD NEW DELTA RATIO
Regression
StaticArray.o 14045 15156 +7.9% 0.93x
DataBenchmarks.o 55956 56956 +1.8% 0.98x
DictionaryKeysContains.o 11815 11999 +1.6% 0.98x

Performance: -Osize

TEST OLD NEW DELTA RATIO
Regression
IterateData 1397 1668 +19.4% 0.84x
InsertCharacterEndIndex 155 167 +7.7% 0.93x
Improvement
ObjectiveCBridgeStubFromArrayOfNSString2 3815 3346 -12.3% 1.14x (?)
StringEqualPointerComparison 647 588 -9.1% 1.10x

Code size: -Osize

TEST OLD NEW DELTA RATIO
Regression
StaticArray.o 13025 14860 +14.1% 0.88x
ReversedCollections.o 11596 11820 +1.9% 0.98x
DictionaryKeysContains.o 11503 11687 +1.6% 0.98x
RomanNumbers.o 5630 5695 +1.2% 0.99x

Performance: -Onone

TEST OLD NEW DELTA RATIO
Improvement
RemoveWhereMoveInts 2720 2312 -15.0% 1.18x
RemoveWhereMoveStrings 3257 2872 -11.8% 1.13x

Code size: -swiftlibs

TEST OLD NEW DELTA RATIO
Regression
libswiftSwiftOnoneSupport.dylib 163840 172032 +5.0% 0.95x
libswiftFoundation.dylib 1523712 1544192 +1.3% 0.99x
How to read the data The tables contain differences in performance which are larger than 8% and differences in code size which are larger than 1%.

If you see any unexpected regressions, you should consider fixing the
regressions before you merge the PR.

Noise: Sometimes the performance results (not code size!) contain false
alarms. Unexpected regressions which are marked with '(?)' are probably noise.
If you see regressions which you cannot explain you can try to run the
benchmarks again. If regressions still show up, please consult with the
performance team (@eeckstein).

Hardware Overview
  Model Name: Mac Pro
  Model Identifier: MacPro6,1
  Processor Name: 12-Core Intel Xeon E5
  Processor Speed: 2.7 GHz
  Number of Processors: 1
  Total Number of Cores: 12
  L2 Cache (per Core): 256 KB
  L3 Cache: 30 MB
  Memory: 64 GB
--------------

@jckarter
Copy link
Contributor

jckarter commented Jan 15, 2019

@vedantk This is awesome! The code size hit might be worth it even at -Osize if it gives a good resident set win when paired with a cooperative linker. Part of the point of -Osize is to reduce memory usage by reducing code size, after all, and this more directly addresses that issue. Maybe there's a better way we could emit overflow traps to make them more splitting-friendly too.

@vedantk
Copy link
Contributor Author

vedantk commented Jan 15, 2019

@jckarter thanks for taking a look! I haven't taken a close look yet at how Swift emits overflow traps so I'm not sure whether that would need to change.

I should point out that there are two more issues with the experiment done in this PR: 1) the splitting pass is scheduled after inlining, and 2) it doesn't look like SimplifyCFG has a chance to run afterwards and clean up some of the mess CodeExtractor leaves behind. I think it'd be worth repeating the experiment with the pipeline fixed to get more realistic numbers.

@vedantk
Copy link
Contributor Author

vedantk commented Jan 15, 2019

Closing, as the sanity check I originally wanted is done.

@vedantk vedantk closed this Jan 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants