Speed up folds on Sequences #510

oisdk · 2018-01-23T20:10:29Z

As referenced in #504.

After noticing that foldMapWithIndex was significantly faster than foldMap, I rewrote the Foldable methods to mimic the style of foldMapWithIndex. Writing the first level of recursion out on the finger tree manually, and specialising the folds on nodes and digits manually, yields a significant speedup. From my testing, there's a ~3.8x speedup for foldMap, a ~1.8x speedup for foldl' and foldr', and an ~8x speedup for foldl and foldr.

These are the functions benchmarked:

foldMapSum :: Seq.Seq Int -> Int
foldMapSum = getSum . foldMap Sum

foldlSum :: Seq.Seq Int -> Int
foldlSum xs = foldl (\k x z -> k $! z+x) id xs 0

foldlSum' :: Seq.Seq Int -> Int
foldlSum' = foldl' (+) 0

foldrSum :: Seq.Seq Int -> Int
foldrSum xs = foldr (\x k z -> k $! z+x) id xs 0

foldrSum' :: Seq.Seq Int -> Int
foldrSum' = foldr' (+) 0

And these are the results, when run on random sequences of length 500000:

benchmarking 500000/foldMapSum/new
time                 3.483 ms   (3.457 ms .. 3.504 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 3.429 ms   (3.398 ms .. 3.458 ms)
std dev              100.7 μs   (79.44 μs .. 153.5 μs)
variance introduced by outliers: 13% (moderately inflated)

benchmarking 500000/foldMapSum/old
time                 13.30 ms   (12.62 ms .. 14.20 ms)
                     0.981 R²   (0.968 R² .. 0.992 R²)
mean                 12.41 ms   (12.12 ms .. 12.85 ms)
std dev              916.9 μs   (674.8 μs .. 1.243 ms)
variance introduced by outliers: 37% (moderately inflated)

benchmarking 500000/foldrSum/new
time                 4.349 ms   (4.274 ms .. 4.437 ms)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 3.991 ms   (3.916 ms .. 4.068 ms)
std dev              197.3 μs   (166.6 μs .. 255.0 μs)
variance introduced by outliers: 27% (moderately inflated)

benchmarking 500000/foldrSum/old
time                 37.39 ms   (36.83 ms .. 38.49 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 35.95 ms   (35.35 ms .. 36.56 ms)
std dev              1.151 ms   (598.5 μs .. 1.561 ms)

benchmarking 500000/foldrSum'/new
time                 2.725 ms   (2.695 ms .. 2.774 ms)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 2.694 ms   (2.665 ms .. 2.728 ms)
std dev              103.0 μs   (78.28 μs .. 143.5 μs)
variance introduced by outliers: 21% (moderately inflated)

benchmarking 500000/foldrSum'/old
time                 4.926 ms   (4.809 ms .. 5.116 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 4.811 ms   (4.762 ms .. 4.883 ms)
std dev              166.3 μs   (124.8 μs .. 211.4 μs)
variance introduced by outliers: 15% (moderately inflated)

benchmarking 500000/foldlSum/new
time                 5.207 ms   (5.071 ms .. 5.310 ms)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 4.672 ms   (4.571 ms .. 4.803 ms)
std dev              315.6 μs   (263.3 μs .. 363.5 μs)
variance introduced by outliers: 40% (moderately inflated)

benchmarking 500000/foldlSum/old
time                 38.99 ms   (38.09 ms .. 39.73 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 37.77 ms   (36.82 ms .. 38.26 ms)
std dev              1.307 ms   (622.3 μs .. 1.860 ms)

benchmarking 500000/foldlSum'/new
time                 2.452 ms   (2.413 ms .. 2.487 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 2.359 ms   (2.337 ms .. 2.379 ms)
std dev              64.66 μs   (54.28 μs .. 77.18 μs)
variance introduced by outliers: 13% (moderately inflated)

benchmarking 500000/foldlSum'/old
time                 4.418 ms   (4.273 ms .. 4.548 ms)
                     0.996 R²   (0.994 R² .. 0.999 R²)
mean                 4.188 ms   (4.147 ms .. 4.244 ms)
std dev              139.5 μs   (102.4 μs .. 182.2 μs)
variance introduced by outliers: 16% (moderately inflated)

For reference, the foldMapWithIndex function could be used to sum as well:

foldMapWithIndexSum :: Seq.Seq Int -> Int
foldMapWithIndexSum = getSum . Seq.foldMapWithIndex (const Sum)

And these are its times (unchanged between old/new):

benchmarking 500000/foldMapWithIndexSum
time                 4.476 ms   (4.407 ms .. 4.532 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 4.351 ms   (4.312 ms .. 4.404 ms)
std dev              129.5 μs   (106.3 μs .. 183.3 μs)
variance introduced by outliers: 12% (moderately inflated)

merge

oisdk · 2018-01-23T20:10:58Z

(oh, and I noticed a typo I had added in the sequence benchmark file)

treeowl · 2018-01-23T20:23:29Z

Just a couple more things to check: 1. How are the foldl and foldr default definitions using your new foldMap? 2. How are the foldl' and foldr' default definitions using your new foldr and foldl? I don't imagine we'll be able to use the defaults, but it's worth a shot to keep the already-enormous source code size down.

…

On Tue, Jan 23, 2018 at 3:10 PM, Donnacha Oisín Kidney < ***@***.***> wrote: (oh, and I noticed a typo I had added in the sequence benchmark file) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#510 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABzi_ducqfSjOOFE0GK7lOGEYNjuZQtiks5tNjzSgaJpZM4RqPjy> .

oisdk · 2018-01-23T20:36:43Z

foldl and foldr go back to their original speeds (pretty much exactly) on the default definition, and the default foldr' and foldl' are actually slower than what was there before.

treeowl

This is much better. I really like the fact that the Seq and Elem business stays out of the FingerTree code. There are a few other places where I ended up mashing those together because I couldn't find a better way. If you think we can unmash them, it would be really awesome to split off an entire Data.Sequence.Internal.FingerTree module. I suspect doing so may reduce the amount of recompilation we have to do to run the test suite.

treeowl · 2018-01-23T20:25:45Z

Data/Sequence/Internal.hs

+(.#) f _ = coerce f
+#else
+(.#) :: (b -> c) -> (a -> b) -> a -> c
+(.#) f g = \x -> f (g x)


This can be written (.#) = (.). Should we add a hidden Coercions module in Utils for this and (#.)?

I think that'd be a good idea. Another candidate for it would be liftLeftFold (or something, I'm not sure of the convention) which could be instead of lift_elem

treeowl · 2018-01-23T20:43:06Z

Any sense of why the old versions of foldr and foldl were better for implementing foldl' and foldr' than the new and improved ones? How much better were they?

treeowl · 2018-01-23T20:51:14Z

I'm not sure your foldl and foldr benchmarks are really the best ones to use. Sums should surely use strict folds. foldl and foldr are mainly for

Converting to lists or list-like structures and/or
Folding with one or more strict accumulators.

I want things like this to be efficient, if possible:

f x0 y0 z0 as= foldr go stop as x0 y0 z0 where
  go a r !x !y !z = ...
  stop !x !y !z = ...

g x0 as = foldr go stop as x0 where
  go a r !x = let !(a',x') = p a x in a' : r x'
  stop !x = []

oisdk · 2018-01-23T20:56:28Z

The old foldl' and foldr' were written manually, whereas the default implementations called in to foldr and foldl. I haven't benchmarked what the default implementations which rely on the old foldl and foldr would do, but I'd imagine it's similar to the current measurements for foldlSum and foldrSum, as those functions I pretty much lifted from the default definitions of foldl and foldr. (the default foldl' and foldr' were ~15ms)

treeowl · 2018-01-23T20:59:03Z

Oh, sorry, my memory was out of date. We used to use the defaults for foldl' and foldr'.

treeowl · 2018-01-23T21:04:11Z

FYI: two mashed places that come to mind are splitMap (used to implement zipWith and chunksOf) and the Applicative machinery. I don't know if there's a way to modify aptyMiddle or any of the rest of that to make it sensible in any other context, but it might be worth thinking about.

oisdk · 2018-01-23T21:14:18Z

I'll look for a good candidate for multiple strict accumulators for foldr, but in the meantime toList has the following improvement:

benchmarking 500000/toList/new
time                 14.10 ms   (13.87 ms .. 14.34 ms)
                     0.998 R²   (0.995 R² .. 0.999 R²)
mean                 13.67 ms   (13.51 ms .. 13.86 ms)
std dev              456.7 μs   (346.3 μs .. 629.7 μs)
variance introduced by outliers: 11% (moderately inflated)
             
benchmarking 500000/toList/old
time                 18.96 ms   (18.56 ms .. 19.35 ms)
                     0.997 R²   (0.995 R² .. 0.999 R²)
mean                 18.17 ms   (17.73 ms .. 18.51 ms)
std dev              902.7 μs   (499.3 μs .. 1.552 ms)
variance introduced by outliers: 18% (moderately inflated)

treeowl · 2018-01-23T21:15:19Z

GHC 7.8 needs a different implementation of (.#) or maybe a different type signature for it. Check what Data.Profunctors.Unsafe does.

treeowl · 2018-01-23T21:16:32Z

Actually, we don't need to be as fancy as Data.Profunctors.Unsafe, because we know what types we're coercing. Just swap the arguments to Coercible in the type signature.

oisdk · 2018-01-23T21:45:48Z

So I've taken some of the "FB" forms of functions from Data.List:

foldrTake :: Int -> Seq.Seq Int -> [Int]
foldrTake n xs = foldr (\x xs m -> case m of 1 -> [x]; _ -> x : xs (m-1)) (const []) xs n

foldrScanl :: Seq.Seq Int -> [Int]
foldrScanl bs = 0 : foldr (\b g -> oneShot (\x -> let !b' = x + b in b' : g b')) (const []) bs 0

And here are the results:

benchmarking 500000/foldrScanl/new
time                 13.35 ms   (12.04 ms .. 14.71 ms)
                     0.963 R²   (0.933 R² .. 0.987 R²)
mean                 14.09 ms   (13.62 ms .. 14.76 ms)
std dev              1.413 ms   (1.070 ms .. 1.841 ms)
variance introduced by outliers: 48% (moderately inflated)
             
benchmarking 500000/foldrScanl/old
time                 55.00 ms   (52.27 ms .. 58.46 ms)
                     0.993 R²   (0.985 R² .. 0.997 R²)
mean                 47.31 ms   (44.97 ms .. 49.78 ms)
std dev              4.565 ms   (3.825 ms .. 5.301 ms)
variance introduced by outliers: 37% (moderately inflated)
             
benchmarking 500000/foldrTake/new
time                 6.626 ms   (6.199 ms .. 7.081 ms)
                     0.952 R²   (0.911 R² .. 0.980 R²)
mean                 6.061 ms   (5.839 ms .. 6.432 ms)
std dev              871.2 μs   (629.0 μs .. 1.357 ms)
variance introduced by outliers: 73% (severely inflated)
             
benchmarking 500000/foldrTake/old
time                 23.09 ms   (21.65 ms .. 25.00 ms)
                     0.981 R²   (0.960 R² .. 0.997 R²)
mean                 22.55 ms   (21.52 ms .. 23.37 ms)
std dev              1.999 ms   (1.438 ms .. 2.884 ms)
variance introduced by outliers: 35% (moderately inflated)

treeowl · 2018-01-23T21:56:47Z

All right. I'll give this one more look later and merge. Great work.

…

On Jan 23, 2018 4:45 PM, "Donnacha Oisín Kidney" ***@***.***> wrote: So I've taken some of the "FB" forms of functions from Data.List: foldrTake :: Int -> Seq.Seq Int -> [Int] foldrTake n xs = foldr (\x xs m -> case m of 1 -> [x]; _ -> x : xs (m-1)) (const []) xs n foldrScanl :: Seq.Seq Int -> [Int] foldrScanl bs = 0 : foldr (\b g -> oneShot (\x -> let !b' = x + b in b' : g b')) (const []) bs 0 And here are the results: benchmarking 500000/foldrScanl/new time 13.35 ms (12.04 ms .. 14.71 ms) 0.963 R² (0.933 R² .. 0.987 R²) mean 14.09 ms (13.62 ms .. 14.76 ms) std dev 1.413 ms (1.070 ms .. 1.841 ms) variance introduced by outliers: 48% (moderately inflated) benchmarking 500000/foldrScanl/old time 55.00 ms (52.27 ms .. 58.46 ms) 0.993 R² (0.985 R² .. 0.997 R²) mean 47.31 ms (44.97 ms .. 49.78 ms) std dev 4.565 ms (3.825 ms .. 5.301 ms) variance introduced by outliers: 37% (moderately inflated) benchmarking 500000/foldrTake/new time 6.626 ms (6.199 ms .. 7.081 ms) 0.952 R² (0.911 R² .. 0.980 R²) mean 6.061 ms (5.839 ms .. 6.432 ms) std dev 871.2 μs (629.0 μs .. 1.357 ms) variance introduced by outliers: 73% (severely inflated) benchmarking 500000/foldrTake/old time 23.09 ms (21.65 ms .. 25.00 ms) 0.981 R² (0.960 R² .. 0.997 R²) mean 22.55 ms (21.52 ms .. 23.37 ms) std dev 1.999 ms (1.438 ms .. 2.884 ms) variance introduced by outliers: 35% (moderately inflated) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#510 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABzi_ZeIZI-v7VkwrhOlZ8GDEcBxRsYeks5tNlMNgaJpZM4RqPjy> .

… used anyway)

oisdk · 2018-01-24T18:55:23Z

Just a couple notes on this pull request:

This includes the changes to the strictness of foldr' and foldl' in the initial accumulator. Removing those changes isn't a problem (it doesn't affect performance, as far as I can tell), and it might be better to put those in their own pull request (and add some tests on them also).
From some initial benchmarks, it looks like other functions were also suffering from the lack of inlining and specialisation that was affecting the folds. In particular, traverse: a simplistic implementation (using the new foldr):
```
traverse f = foldr (liftA2 (<|) . f) (pure empty)
```
looks like it's slightly faster than the current one.
I would like to split out the finger tree stuff from the rest of it also, somehow (I think it would be cool to be able to use it like Data.FingerTree), although I'm not at all familiar with the Applicative and splitMap code yet.

treeowl · 2018-01-24T18:57:54Z

Let's leave the strictness alone for right now. That needs a separate PR, and probably also discussion on the libraries list. And likely a major version bump. Blech.

…

On Jan 24, 2018 1:55 PM, "Donnacha Oisín Kidney" ***@***.***> wrote: Just a couple notes on this pull request: - This includes the changes to the strictness of foldr' and foldl' in the initial accumulator. Removing those changes isn't a problem (it doesn't affect performance, as far as I can tell), and it might be better to put those in their own pull request (and add some tests on them also). - From some initial benchmarks, it looks like other functions were also suffering from the lack of inlining and specialisation that was affecting the folds. In particular, traverse: a simplistic implementation (using the new foldr): traverse f = foldr (liftA2 (<|) . f) (pure empty) looks like it's slightly faster than the current one. - I would like to split out the finger tree stuff from the rest of it also, somehow (I think it would be cool to be able to use it like Data.FingerTree), although I'm not at all familiar with the Applicative and splitMap code yet. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#510 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABzi_S6jIUvaVm269s7iHiBX0TXVVl7Qks5tN3ycgaJpZM4RqPjy> .

oisdk · 2018-01-24T19:00:26Z

Sounds good! I'll change it to the old strictness, and add tests for it

treeowl · 2018-01-25T13:53:55Z

Sweet.

oisdk and others added 10 commits January 20, 2018 21:16

Merge pull request #1 from haskell/master

b0d4c63

merge

Merge pull request #2 from haskell/master

bdf22fb

merge

much faster foldMap

2370e0c

much faster foldl'

2310306

pushed faster foldl' down into fingertree rather than seq

d2e275a

push foldMap improvements down into fingertree

c740b07

much quicker foldr

d46a119

strictness matching list

4650953

more folds optimised

81f23da

name shadowing warnings

d4fd071

treeowl reviewed Jan 23, 2018

View reviewed changes

put coercions in their own module

fc4b415

oisdk added 5 commits January 23, 2018 22:24

added hide pragma to coercions module

6f69209

consistent naming

7e249f1

removed coersion that was causing trouble on GHC 7.8. (it's not being…

580556f

… used anyway)

updated fixity of (.#) tomatch Data.Profunctor.Unsafe

c295491

added coercion operator that can be used in foldl

f999e9a

oisdk added 2 commits January 24, 2018 19:04

back to the old strictness

0117e13

Added tests for the laziness of foldr' and foldl'

88a2c75

treeowl merged commit a4b7392 into haskell:master Jan 25, 2018

oisdk deleted the sequence-foldMap-perf branch January 25, 2018 17:05

treeowl mentioned this pull request Jan 29, 2018

Speed up foldMap for sequences #504

Closed

meooow25 mentioned this pull request Sep 7, 2024

Faster Eq and Ord #1016

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up folds on Sequences #510

Speed up folds on Sequences #510

oisdk commented Jan 23, 2018

oisdk commented Jan 23, 2018

treeowl commented Jan 23, 2018 via email

oisdk commented Jan 23, 2018

treeowl left a comment

treeowl Jan 23, 2018

oisdk Jan 23, 2018

treeowl commented Jan 23, 2018

treeowl commented Jan 23, 2018 •

edited

Loading

oisdk commented Jan 23, 2018

treeowl commented Jan 23, 2018

treeowl commented Jan 23, 2018

oisdk commented Jan 23, 2018

treeowl commented Jan 23, 2018

treeowl commented Jan 23, 2018

oisdk commented Jan 23, 2018

treeowl commented Jan 23, 2018 via email

oisdk commented Jan 24, 2018

treeowl commented Jan 24, 2018 via email

oisdk commented Jan 24, 2018

treeowl commented Jan 25, 2018

Speed up folds on Sequences #510

Speed up folds on Sequences #510

Conversation

oisdk commented Jan 23, 2018

oisdk commented Jan 23, 2018

treeowl commented Jan 23, 2018 via email

oisdk commented Jan 23, 2018

treeowl left a comment

Choose a reason for hiding this comment

treeowl Jan 23, 2018

Choose a reason for hiding this comment

oisdk Jan 23, 2018

Choose a reason for hiding this comment

treeowl commented Jan 23, 2018

treeowl commented Jan 23, 2018 • edited Loading

oisdk commented Jan 23, 2018

treeowl commented Jan 23, 2018

treeowl commented Jan 23, 2018

oisdk commented Jan 23, 2018

treeowl commented Jan 23, 2018

treeowl commented Jan 23, 2018

oisdk commented Jan 23, 2018

treeowl commented Jan 23, 2018 via email

oisdk commented Jan 24, 2018

treeowl commented Jan 24, 2018 via email

oisdk commented Jan 24, 2018

treeowl commented Jan 25, 2018

treeowl commented Jan 23, 2018 •

edited

Loading