Skip to content

Speed up folds on Sequences #510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 25, 2018
Merged

Speed up folds on Sequences #510

merged 18 commits into from
Jan 25, 2018

Conversation

oisdk
Copy link
Contributor

@oisdk oisdk commented Jan 23, 2018

As referenced in #504.

After noticing that foldMapWithIndex was significantly faster than foldMap, I rewrote the Foldable methods to mimic the style of foldMapWithIndex. Writing the first level of recursion out on the finger tree manually, and specialising the folds on nodes and digits manually, yields a significant speedup. From my testing, there's a ~3.8x speedup for foldMap, a ~1.8x speedup for foldl' and foldr', and an ~8x speedup for foldl and foldr.

These are the functions benchmarked:

foldMapSum :: Seq.Seq Int -> Int
foldMapSum = getSum . foldMap Sum

foldlSum :: Seq.Seq Int -> Int
foldlSum xs = foldl (\k x z -> k $! z+x) id xs 0

foldlSum' :: Seq.Seq Int -> Int
foldlSum' = foldl' (+) 0

foldrSum :: Seq.Seq Int -> Int
foldrSum xs = foldr (\x k z -> k $! z+x) id xs 0

foldrSum' :: Seq.Seq Int -> Int
foldrSum' = foldr' (+) 0

And these are the results, when run on random sequences of length 500000:

benchmarking 500000/foldMapSum/new
time                 3.483 ms   (3.457 ms .. 3.504 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 3.429 ms   (3.398 ms .. 3.458 ms)
std dev              100.7 μs   (79.44 μs .. 153.5 μs)
variance introduced by outliers: 13% (moderately inflated)

benchmarking 500000/foldMapSum/old
time                 13.30 ms   (12.62 ms .. 14.20 ms)
                     0.981 R²   (0.968 R² .. 0.992 R²)
mean                 12.41 ms   (12.12 ms .. 12.85 ms)
std dev              916.9 μs   (674.8 μs .. 1.243 ms)
variance introduced by outliers: 37% (moderately inflated)

benchmarking 500000/foldrSum/new
time                 4.349 ms   (4.274 ms .. 4.437 ms)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 3.991 ms   (3.916 ms .. 4.068 ms)
std dev              197.3 μs   (166.6 μs .. 255.0 μs)
variance introduced by outliers: 27% (moderately inflated)

benchmarking 500000/foldrSum/old
time                 37.39 ms   (36.83 ms .. 38.49 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 35.95 ms   (35.35 ms .. 36.56 ms)
std dev              1.151 ms   (598.5 μs .. 1.561 ms)

benchmarking 500000/foldrSum'/new
time                 2.725 ms   (2.695 ms .. 2.774 ms)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 2.694 ms   (2.665 ms .. 2.728 ms)
std dev              103.0 μs   (78.28 μs .. 143.5 μs)
variance introduced by outliers: 21% (moderately inflated)

benchmarking 500000/foldrSum'/old
time                 4.926 ms   (4.809 ms .. 5.116 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 4.811 ms   (4.762 ms .. 4.883 ms)
std dev              166.3 μs   (124.8 μs .. 211.4 μs)
variance introduced by outliers: 15% (moderately inflated)

benchmarking 500000/foldlSum/new
time                 5.207 ms   (5.071 ms .. 5.310 ms)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 4.672 ms   (4.571 ms .. 4.803 ms)
std dev              315.6 μs   (263.3 μs .. 363.5 μs)
variance introduced by outliers: 40% (moderately inflated)

benchmarking 500000/foldlSum/old
time                 38.99 ms   (38.09 ms .. 39.73 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 37.77 ms   (36.82 ms .. 38.26 ms)
std dev              1.307 ms   (622.3 μs .. 1.860 ms)

benchmarking 500000/foldlSum'/new
time                 2.452 ms   (2.413 ms .. 2.487 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 2.359 ms   (2.337 ms .. 2.379 ms)
std dev              64.66 μs   (54.28 μs .. 77.18 μs)
variance introduced by outliers: 13% (moderately inflated)

benchmarking 500000/foldlSum'/old
time                 4.418 ms   (4.273 ms .. 4.548 ms)
                     0.996 R²   (0.994 R² .. 0.999 R²)
mean                 4.188 ms   (4.147 ms .. 4.244 ms)
std dev              139.5 μs   (102.4 μs .. 182.2 μs)
variance introduced by outliers: 16% (moderately inflated)

For reference, the foldMapWithIndex function could be used to sum as well:

foldMapWithIndexSum :: Seq.Seq Int -> Int
foldMapWithIndexSum = getSum . Seq.foldMapWithIndex (const Sum)

And these are its times (unchanged between old/new):

benchmarking 500000/foldMapWithIndexSum
time                 4.476 ms   (4.407 ms .. 4.532 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 4.351 ms   (4.312 ms .. 4.404 ms)
std dev              129.5 μs   (106.3 μs .. 183.3 μs)
variance introduced by outliers: 12% (moderately inflated)

@oisdk
Copy link
Contributor Author

oisdk commented Jan 23, 2018

(oh, and I noticed a typo I had added in the sequence benchmark file)

@treeowl
Copy link
Contributor

treeowl commented Jan 23, 2018 via email

@oisdk
Copy link
Contributor Author

oisdk commented Jan 23, 2018

foldl and foldr go back to their original speeds (pretty much exactly) on the default definition, and the default foldr' and foldl' are actually slower than what was there before.

Copy link
Contributor

@treeowl treeowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better. I really like the fact that the Seq and Elem business stays out of the FingerTree code. There are a few other places where I ended up mashing those together because I couldn't find a better way. If you think we can unmash them, it would be really awesome to split off an entire Data.Sequence.Internal.FingerTree module. I suspect doing so may reduce the amount of recompilation we have to do to run the test suite.

(.#) f _ = coerce f
#else
(.#) :: (b -> c) -> (a -> b) -> a -> c
(.#) f g = \x -> f (g x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be written (.#) = (.). Should we add a hidden Coercions module in Utils for this and (#.)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'd be a good idea. Another candidate for it would be liftLeftFold (or something, I'm not sure of the convention) which could be instead of lift_elem

@treeowl
Copy link
Contributor

treeowl commented Jan 23, 2018

Any sense of why the old versions of foldr and foldl were better for implementing foldl' and foldr' than the new and improved ones? How much better were they?

@treeowl
Copy link
Contributor

treeowl commented Jan 23, 2018

I'm not sure your foldl and foldr benchmarks are really the best ones to use. Sums should surely use strict folds. foldl and foldr are mainly for

  1. Converting to lists or list-like structures and/or
  2. Folding with one or more strict accumulators.

I want things like this to be efficient, if possible:

f x0 y0 z0 as= foldr go stop as x0 y0 z0 where
  go a r !x !y !z = ...
  stop !x !y !z = ...

g x0 as = foldr go stop as x0 where
  go a r !x = let !(a',x') = p a x in a' : r x'
  stop !x = []

@oisdk
Copy link
Contributor Author

oisdk commented Jan 23, 2018

The old foldl' and foldr' were written manually, whereas the default implementations called in to foldr and foldl. I haven't benchmarked what the default implementations which rely on the old foldl and foldr would do, but I'd imagine it's similar to the current measurements for foldlSum and foldrSum, as those functions I pretty much lifted from the default definitions of foldl and foldr. (the default foldl' and foldr' were ~15ms)

@treeowl
Copy link
Contributor

treeowl commented Jan 23, 2018

Oh, sorry, my memory was out of date. We used to use the defaults for foldl' and foldr'.

@treeowl
Copy link
Contributor

treeowl commented Jan 23, 2018

FYI: two mashed places that come to mind are splitMap (used to implement zipWith and chunksOf) and the Applicative machinery. I don't know if there's a way to modify aptyMiddle or any of the rest of that to make it sensible in any other context, but it might be worth thinking about.

@oisdk
Copy link
Contributor Author

oisdk commented Jan 23, 2018

I'll look for a good candidate for multiple strict accumulators for foldr, but in the meantime toList has the following improvement:

benchmarking 500000/toList/new
time                 14.10 ms   (13.87 ms .. 14.34 ms)
                     0.998 R²   (0.995 R² .. 0.999 R²)
mean                 13.67 ms   (13.51 ms .. 13.86 ms)
std dev              456.7 μs   (346.3 μs .. 629.7 μs)
variance introduced by outliers: 11% (moderately inflated)
             
benchmarking 500000/toList/old
time                 18.96 ms   (18.56 ms .. 19.35 ms)
                     0.997 R²   (0.995 R² .. 0.999 R²)
mean                 18.17 ms   (17.73 ms .. 18.51 ms)
std dev              902.7 μs   (499.3 μs .. 1.552 ms)
variance introduced by outliers: 18% (moderately inflated)

@treeowl
Copy link
Contributor

treeowl commented Jan 23, 2018

GHC 7.8 needs a different implementation of (.#) or maybe a different type signature for it. Check what Data.Profunctors.Unsafe does.

@treeowl
Copy link
Contributor

treeowl commented Jan 23, 2018

Actually, we don't need to be as fancy as Data.Profunctors.Unsafe, because we know what types we're coercing. Just swap the arguments to Coercible in the type signature.

@oisdk
Copy link
Contributor Author

oisdk commented Jan 23, 2018

So I've taken some of the "FB" forms of functions from Data.List:

foldrTake :: Int -> Seq.Seq Int -> [Int]
foldrTake n xs = foldr (\x xs m -> case m of 1 -> [x]; _ -> x : xs (m-1)) (const []) xs n

foldrScanl :: Seq.Seq Int -> [Int]
foldrScanl bs = 0 : foldr (\b g -> oneShot (\x -> let !b' = x + b in b' : g b')) (const []) bs 0

And here are the results:

benchmarking 500000/foldrScanl/new
time                 13.35 ms   (12.04 ms .. 14.71 ms)
                     0.963 R²   (0.933 R² .. 0.987 R²)
mean                 14.09 ms   (13.62 ms .. 14.76 ms)
std dev              1.413 ms   (1.070 ms .. 1.841 ms)
variance introduced by outliers: 48% (moderately inflated)
             
benchmarking 500000/foldrScanl/old
time                 55.00 ms   (52.27 ms .. 58.46 ms)
                     0.993 R²   (0.985 R² .. 0.997 R²)
mean                 47.31 ms   (44.97 ms .. 49.78 ms)
std dev              4.565 ms   (3.825 ms .. 5.301 ms)
variance introduced by outliers: 37% (moderately inflated)
             
benchmarking 500000/foldrTake/new
time                 6.626 ms   (6.199 ms .. 7.081 ms)
                     0.952 R²   (0.911 R² .. 0.980 R²)
mean                 6.061 ms   (5.839 ms .. 6.432 ms)
std dev              871.2 μs   (629.0 μs .. 1.357 ms)
variance introduced by outliers: 73% (severely inflated)
             
benchmarking 500000/foldrTake/old
time                 23.09 ms   (21.65 ms .. 25.00 ms)
                     0.981 R²   (0.960 R² .. 0.997 R²)
mean                 22.55 ms   (21.52 ms .. 23.37 ms)
std dev              1.999 ms   (1.438 ms .. 2.884 ms)
variance introduced by outliers: 35% (moderately inflated)

@treeowl
Copy link
Contributor

treeowl commented Jan 23, 2018 via email

@oisdk
Copy link
Contributor Author

oisdk commented Jan 24, 2018

Just a couple notes on this pull request:

  • This includes the changes to the strictness of foldr' and foldl' in the initial accumulator. Removing those changes isn't a problem (it doesn't affect performance, as far as I can tell), and it might be better to put those in their own pull request (and add some tests on them also).

  • From some initial benchmarks, it looks like other functions were also suffering from the lack of inlining and specialisation that was affecting the folds. In particular, traverse: a simplistic implementation (using the new foldr):

    traverse f = foldr (liftA2 (<|) . f) (pure empty)

    looks like it's slightly faster than the current one.

  • I would like to split out the finger tree stuff from the rest of it also, somehow (I think it would be cool to be able to use it like Data.FingerTree), although I'm not at all familiar with the Applicative and splitMap code yet.

@treeowl
Copy link
Contributor

treeowl commented Jan 24, 2018 via email

@oisdk
Copy link
Contributor Author

oisdk commented Jan 24, 2018

Sounds good! I'll change it to the old strictness, and add tests for it

@treeowl treeowl merged commit a4b7392 into haskell:master Jan 25, 2018
@treeowl
Copy link
Contributor

treeowl commented Jan 25, 2018

Sweet.

@oisdk oisdk deleted the sequence-foldMap-perf branch January 25, 2018 17:05
@meooow25 meooow25 mentioned this pull request Sep 7, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants