Skip to content

List Functor: mix unrolled and reverse map #135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

matthewleon
Copy link
Contributor

Addresses #131

The relevant chunk sizes (5 for the initial list segment), (3 for the
tail-recursive remainder) were arrived at through benchmarked
experimentation, mapping a simple (_ + 1) through lists of various
sizes.

Relevant figures:
list of 1000 elems: 142.61 μs -> 36.97 μs
list of 2000 elems: 275.17 μs -> 55.33 μs
list of 10000 elems: 912.73 μs -> 208.39 μs
list of 100000 elems: 34.56 ms -> 1.24 ms

The ~30x speed increase for long lists is probably explained by the lack
of GC thrashing with this approach.

@matthewleon
Copy link
Contributor Author

Will update this with details on my machine setup.

@matthewleon matthewleon force-pushed the fast-functor branch 2 times, most recently from f6b1ae7 to 550e096 Compare November 21, 2017 21:43
-- chunked list Functor inspired by OCaml
-- https://discuss.ocaml.org/t/a-new-list-map-that-is-both-stack-safe-and-fast/865
-- chunk sizes determined through experimentation
listMap :: forall a b. (a -> b) -> List a -> List b
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exported right now, is that deliberate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It isn't. Good catch.

startUnrolledMap :: Int -> List a -> List b
startUnrolledMap 0 (x : xs) = f x : chunkedRevMap xs
startUnrolledMap n (x1 : x2 : x3 : x4 : x5 : xs) =
f x1 : f x2 : f x3 : f x4 : f x5 : startUnrolledMap (n - 1) xs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if OCaml has tail calls modulo cons, but isn't this going to lead to a stack overflow for us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why I've limited the "unrolled map" to 1000 iterations. It should avoid stack overflow. Are you suggesting that there is a way to make it stack overflow despite that limit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not, unless f was already on the verge of overflowing the stack. It can only make an existing problem a finite amount worse :)


startUnrolledMap :: Int -> List a -> List b
startUnrolledMap 0 (x : xs) = f x : chunkedRevMap xs
startUnrolledMap n (x1 : x2 : x3 : x4 : x5 : xs) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be made faster if you hand-optimize the cases. The JS generated here isn't going to be particularly optimal. We can reuse the branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought the same thing. I'll look at it if the option you suggest with mutation isn't the way to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just mean rewriting this so that common cases use the same branch, instead of having multiple branches which recognize lists of length >= 1 for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at the generated JS, you'll hopefully see what I mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case xs of
  x : xs' -> ...
    case xs' of
      x' : xs'' -> ...
      _ -> ...
  _ -> ...

PureScript doesn't perform this optimization, you have to do it manually.

@paf31
Copy link
Contributor

paf31 commented Nov 22, 2017

Thanks!

Another thing to try is a straight traversal of the spine using a mutable cons cell, a la Scheme. I think we could make this safe. What do you think?

@matthewleon
Copy link
Contributor Author

straight traversal of the spine using a mutable cons cell, a la Scheme

I will research this and see if I can come up with something crafty. Is the idea here that we jump into ST?

@paf31
Copy link
Contributor

paf31 commented Nov 22, 2017

I think it would be FFI instead of ST unfortunately, unless we can come up with a way to modify cons cells in ST, which would mean some sort of weird STList data type.

@matthewleon
Copy link
Contributor Author

unless we can come up with a way to modify cons cells in ST, which would mean some sort of weird STList data type.

I was thinking along those lines. I'll try it and benchmark it.

@matthewleon
Copy link
Contributor Author

matthewleon commented Nov 22, 2017

Though upon further thought, I don't see how to do that without running into the problem of stack-safe recursion in Eff being very slow. Perhaps the saner thing to do is to try the mutable cons cell in FFI... I'll have to use JS anyway to optimize the unroll.

@paf31
Copy link
Contributor

paf31 commented Nov 23, 2017

Yes, agreed, it doesn't seem worth it to make the code non-portable.

@matthewleon
Copy link
Contributor Author

Yes, agreed, it doesn't seem worth it to make the code non-portable.

So do you prefer that the code be largely left as-is, because this version is portable and has much better performance than what we had before, at least in Node?

Or should I go ahead and try to do the mutable cons cell + unrolling in FFI JavaScript?

The answer to me really isn't obvious, as I'm unaware of what PS's stance regarding library portability is. As it stands, this lib doesn't use any FFI at all – directly. I believe that Lazy, however, uses FFI. Which makes at least the lazy lists non-portable, no?

@paf31
Copy link
Contributor

paf31 commented Nov 23, 2017

Well the lazy data structure doesn't rely on the compiler internals in any way.

Here is a quick FFI experiment:

exports.mapList = function(f) {
  return function(l) {
    if (l.constructor.name === 'Nil') {
      return l;
    } else {
      var nil = require('Data.List.Types').Nil;
      var cons = require('Data.List.Types').Cons;
      var result = new cons(null, null);

      var input = l;
      var output = result;

      while (true) {
        output.value0 = f(input.value0);
        output.value1 = new cons(null, null);

        if (input.value1 instanceof nil) {
          break;
        }

        input = input.value1;
        output = output.value1;
      }

      output.value1 = nil.value;

      return result;
    }
  }
}

Performance-wise, it's about twice as fast, at the expense of being less portable:

> import Main
> import Data.List
> import Performance.Minibench 
> xs = range 1 100000

> benchWith 1000 \_ -> map (\x -> x * 2) xs
mean   = 3.26 ms
stddev = 1.78 ms
min    = 2.58 ms
max    = 53.19 ms
unit

> benchWith 1000 \_ -> mapList (\x -> x * 2) xs
mean   = 1.55 ms
stddev = 1.54 ms
min    = 1.30 ms
max    = 47.51 ms
unit

I think it's better to be portable though, so I might put this in a separate library as a curiosity, but I wouldn't merge it into lists as it is now.

@matthewleon
Copy link
Contributor Author

Recapping conversation with @paf31 on Slack: Though considerable speed gains can be made by using FFI with a mutable Cons cell, the List library is currently portable, and we don't want to introduce FFI just to optimize map further than this.

@paf31
Copy link
Contributor

paf31 commented Nov 23, 2017

I've put my code in a repo here, along with zipWith and filter replacements.

@matthewleon
Copy link
Contributor Author

matthewleon commented Nov 23, 2017 via email

@matthewleon
Copy link
Contributor Author

I've added benchmarking to this branch.

@safareli
Copy link
Contributor

package-lock.json was committed in last commit

@matthewleon
Copy link
Contributor Author

matthewleon commented Nov 26, 2017

@safareli removed it. Thanks!

-- chunk sizes determined through experimentation
listMap :: forall a b. (a -> b) -> List a -> List b
listMap f = startUnrolledMap unrollLimit
where
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: could you please indent things past the where, or move the where up to the end of the previous line?

@paf31
Copy link
Contributor

paf31 commented Nov 30, 2017

Thanks!

What do other people think? @garyb @hdgarrood

Especially about the potential stack overflow issue mentioned above.

@garyb
Copy link
Member

garyb commented Nov 30, 2017

@paf31 what was the stack overflow issue? Just that thing that if the stack is near-overflow already it might push it over the edge?

@hdgarrood
Copy link
Contributor

Yeah, if f makes enough recursive calls on average, I think that could cause stack overflows. This looks really good though! If you could address @paf31's recent comments and do a little testing to check that stack-safety isn't significantly worse than it is now, I'm 👍 to merge.

Come to think of it, do we have any testing for stack overflows in map in this library already?

@paf31
Copy link
Contributor

paf31 commented Nov 30, 2017

To be clear, the issue is that we go from "map never grows the stack" to "map always grows the stack, but maybe not by very much". The growth should be linear in the size of the list.

@natefaubion
Copy link
Contributor

With this, isn't it "map grows the stack by a fixed maximum amount"?

@paf31
Copy link
Contributor

paf31 commented Nov 30, 2017

Well startUnrolledMap calls itself recursively, but not tail-recursively, so no, I think the stack usage grows with the size of the input.

@paf31
Copy link
Contributor

paf31 commented Nov 30, 2017

Nested maps are also quite common. I don't think this is enough to cause a problem on its own, but it could worsen it.

The question for me is not whether this code should be included, but whether it should be the default. Do we want a default which works in every case but is slow in most cases, or one which works in most cases but overflows the stack in a tiny number of cases?

I think it's probably okay to pick this as the default, but I just want to make sure the trade off is clear now as we decide.

@natefaubion
Copy link
Contributor

There's also the question of what the stack limit is. It does 5 items at a time, so it will eat stack up to 5000 items. Maybe we don't need to handle that many, and we can reduce it to 200 frames if it's a concern?

@matthewleon
Copy link
Contributor Author

There's also the question of what the stack limit is. It does 5 items at a time, so it will eat stack up to 5000 items. Maybe we don't need to handle that many, and we can reduce it to 200 frames if it's a concern?

I wouldn't mind doing that. Maybe it's worth me doing some exploratory tests to see how much breathing room Node gives us?

@matthewleon
Copy link
Contributor Author

regarding the call stack size: http://2ality.com/2014/04/call-stack-size.html

@matthewleon
Copy link
Contributor Author

we can reduce it to 200 frames if it's a concern?

I've made a commit that does this. I figure that this means we still get a big speed-up for common List use-cases, while if people are serious about having faster iteration through longer lists, they can use the fast list repo.

2017 MacBook Pro 2.3 GHz Intel Core i5, 8 GB 2133 MHz LPDDR3
Node v8.9.1

List
====
map
---
map: empty list
mean   = 1.31 μs
stddev = 11.87 μs
min    = 799.00 ns
max    = 375.82 μs
map: singleton list
mean   = 2.40 μs
stddev = 11.03 μs
min    = 1.03 μs
max    = 342.18 μs
map: list (1000 elems)
mean   = 143.41 μs
stddev = 225.12 μs
min    = 97.16 μs
max    = 2.03 ms
map: list (2000 elems)
mean   = 274.16 μs
stddev = 295.84 μs
min    = 199.66 μs
max    = 2.06 ms
map: list (5000 elems)
mean   = 531.84 μs
stddev = 512.61 μs
min    = 229.45 μs
max    = 2.95 ms
map: list (10000 elems)
mean   = 895.24 μs
stddev = 777.87 μs
min    = 464.59 μs
max    = 2.94 ms
map: list (100000 elems)
mean   = 33.45 ms
stddev = 7.65 ms
min    = 22.07 ms
max    = 63.47 ms
Addresses purescript#131

The relevant chunk sizes (5 for the initial list segment), (3 for the
tail-recursive remainder) were arrived at through benchmarked
experimentation, mapping a simple (_ + 1) through lists of various
sizes.

Relevant figures:
list of 1000 elems:   142.61 μs -> 36.97 μs
list of 2000 elems:   275.17 μs -> 55.33 μs
list of 10000 elems:  912.73 μs -> 208.39 μs
list of 100000 elems: 34.56 ms  -> 1.24 ms

The ~30x speed increase for long lists is probably explained by the lack
of GC thrashing with this approach.

Benchmarked on 2017 Macbook Pro, 2.3 GHz Intel Core i5, 8 GB RAM.
macOS Sierra 10.12.6
node v8.9.1
this lower the probability of stack-size troubles
@matthewleon
Copy link
Contributor Author

rebased.

@paf31
Copy link
Contributor

paf31 commented Dec 10, 2017

Given the possible stack-safety issue, what do people think about providing the naive (safe) implementation as a separate named function? Or vice versa?

@matthewleon
Copy link
Contributor Author

One possibility is to just use the stack-safe unrolled operation, which should still give a speed boost, and recommend that people turn to purescript-lists-fast if speed is important.

@paf31
Copy link
Contributor

paf31 commented Dec 11, 2017

Would it be possible to try that and compare the performance easily?

@matthewleon matthewleon force-pushed the fast-functor branch 2 times, most recently from ca5d02f to cadf481 Compare December 12, 2017 01:44
@matthewleon
Copy link
Contributor Author

I made a branch that only uses the stack-safe reverse map and... It's actually faster.

map: list (2000 elems)
mean = 53.33 μs -> 31.24 μs

map: list (10000 elems)
mean = 107.73 μs -> 98.17 μs

I've merged it in.

@matthewleon
Copy link
Contributor Author

Just realized I can also inline the unrolledMap, will do that and push.

begin with reverse unrolled map
@matthewleon
Copy link
Contributor Author

Actually, inlining unrolledMap slows things down significantly. Mysteries of JS... This should be good for review.

@paf31
Copy link
Contributor

paf31 commented Dec 12, 2017

Even better, thanks!

Any other comments before I merge this?

@hdgarrood hdgarrood mentioned this pull request Mar 10, 2019
@hdgarrood
Copy link
Contributor

Merged in via #157.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants