-
Notifications
You must be signed in to change notification settings - Fork 50
List Functor: mix unrolled and reverse map #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Will update this with details on my machine setup. |
f6b1ae7
to
550e096
Compare
-- chunked list Functor inspired by OCaml | ||
-- https://discuss.ocaml.org/t/a-new-list-map-that-is-both-stack-safe-and-fast/865 | ||
-- chunk sizes determined through experimentation | ||
listMap :: forall a b. (a -> b) -> List a -> List b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exported right now, is that deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It isn't. Good catch.
src/Data/List/Types.purs
Outdated
startUnrolledMap :: Int -> List a -> List b | ||
startUnrolledMap 0 (x : xs) = f x : chunkedRevMap xs | ||
startUnrolledMap n (x1 : x2 : x3 : x4 : x5 : xs) = | ||
f x1 : f x2 : f x3 : f x4 : f x5 : startUnrolledMap (n - 1) xs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if OCaml has tail calls modulo cons, but isn't this going to lead to a stack overflow for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I've limited the "unrolled map" to 1000 iterations. It should avoid stack overflow. Are you suggesting that there is a way to make it stack overflow despite that limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not, unless f
was already on the verge of overflowing the stack. It can only make an existing problem a finite amount worse :)
src/Data/List/Types.purs
Outdated
|
||
startUnrolledMap :: Int -> List a -> List b | ||
startUnrolledMap 0 (x : xs) = f x : chunkedRevMap xs | ||
startUnrolledMap n (x1 : x2 : x3 : x4 : x5 : xs) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably be made faster if you hand-optimize the cases. The JS generated here isn't going to be particularly optimal. We can reuse the branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought the same thing. I'll look at it if the option you suggest with mutation isn't the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I just mean rewriting this so that common cases use the same branch, instead of having multiple branches which recognize lists of length >= 1 for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look at the generated JS, you'll hopefully see what I mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case xs of
x : xs' -> ...
case xs' of
x' : xs'' -> ...
_ -> ...
_ -> ...
PureScript doesn't perform this optimization, you have to do it manually.
Thanks! Another thing to try is a straight traversal of the spine using a mutable cons cell, a la Scheme. I think we could make this safe. What do you think? |
I will research this and see if I can come up with something crafty. Is the idea here that we jump into |
I think it would be FFI instead of ST unfortunately, unless we can come up with a way to modify cons cells in |
I was thinking along those lines. I'll try it and benchmark it. |
Though upon further thought, I don't see how to do that without running into the problem of stack-safe recursion in |
Yes, agreed, it doesn't seem worth it to make the code non-portable. |
So do you prefer that the code be largely left as-is, because this version is portable and has much better performance than what we had before, at least in Node? Or should I go ahead and try to do the mutable cons cell + unrolling in FFI JavaScript? The answer to me really isn't obvious, as I'm unaware of what PS's stance regarding library portability is. As it stands, this lib doesn't use any FFI at all – directly. I believe that |
Well the lazy data structure doesn't rely on the compiler internals in any way. Here is a quick FFI experiment: exports.mapList = function(f) {
return function(l) {
if (l.constructor.name === 'Nil') {
return l;
} else {
var nil = require('Data.List.Types').Nil;
var cons = require('Data.List.Types').Cons;
var result = new cons(null, null);
var input = l;
var output = result;
while (true) {
output.value0 = f(input.value0);
output.value1 = new cons(null, null);
if (input.value1 instanceof nil) {
break;
}
input = input.value1;
output = output.value1;
}
output.value1 = nil.value;
return result;
}
}
} Performance-wise, it's about twice as fast, at the expense of being less portable:
I think it's better to be portable though, so I might put this in a separate library as a curiosity, but I wouldn't merge it into |
550e096
to
c93cd17
Compare
Recapping conversation with @paf31 on Slack: Though considerable speed gains can be made by using FFI with a mutable Cons cell, the List library is currently portable, and we don't want to introduce FFI just to optimize |
I've put my code in a repo here, along with |
Cool. To be sure, those expected speedups might decrease a fair bit once we
merge in non-FFI optimizations to Data.List.
…On Thu, Nov 23, 2017 at 4:57 PM, Phil Freeman ***@***.***> wrote:
I've put my code in a repo here
<https://github.com/paf31/purescript-lists-fast>, along with zipWith and
filter replacements.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#135 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AARBS-HDNYnlzTOJnDqjiDyI1e28Lo4zks5s5epYgaJpZM4QmhW9>
.
|
I've added benchmarking to this branch. |
|
29e604e
to
7451586
Compare
@safareli removed it. Thanks! |
src/Data/List/Types.purs
Outdated
-- chunk sizes determined through experimentation | ||
listMap :: forall a b. (a -> b) -> List a -> List b | ||
listMap f = startUnrolledMap unrollLimit | ||
where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: could you please indent things past the where
, or move the where
up to the end of the previous line?
Thanks! What do other people think? @garyb @hdgarrood Especially about the potential stack overflow issue mentioned above. |
@paf31 what was the stack overflow issue? Just that thing that if the stack is near-overflow already it might push it over the edge? |
Yeah, if Come to think of it, do we have any testing for stack overflows in |
To be clear, the issue is that we go from " |
With this, isn't it "map grows the stack by a fixed maximum amount"? |
Well |
Nested The question for me is not whether this code should be included, but whether it should be the default. Do we want a default which works in every case but is slow in most cases, or one which works in most cases but overflows the stack in a tiny number of cases? I think it's probably okay to pick this as the default, but I just want to make sure the trade off is clear now as we decide. |
There's also the question of what the stack limit is. It does 5 items at a time, so it will eat stack up to 5000 items. Maybe we don't need to handle that many, and we can reduce it to 200 frames if it's a concern? |
I wouldn't mind doing that. Maybe it's worth me doing some exploratory tests to see how much breathing room Node gives us? |
regarding the call stack size: http://2ality.com/2014/04/call-stack-size.html |
I've made a commit that does this. I figure that this means we still get a big speed-up for common List use-cases, while if people are serious about having faster iteration through longer lists, they can use the fast list repo. |
2017 MacBook Pro 2.3 GHz Intel Core i5, 8 GB 2133 MHz LPDDR3 Node v8.9.1 List ==== map --- map: empty list mean = 1.31 μs stddev = 11.87 μs min = 799.00 ns max = 375.82 μs map: singleton list mean = 2.40 μs stddev = 11.03 μs min = 1.03 μs max = 342.18 μs map: list (1000 elems) mean = 143.41 μs stddev = 225.12 μs min = 97.16 μs max = 2.03 ms map: list (2000 elems) mean = 274.16 μs stddev = 295.84 μs min = 199.66 μs max = 2.06 ms map: list (5000 elems) mean = 531.84 μs stddev = 512.61 μs min = 229.45 μs max = 2.95 ms map: list (10000 elems) mean = 895.24 μs stddev = 777.87 μs min = 464.59 μs max = 2.94 ms map: list (100000 elems) mean = 33.45 ms stddev = 7.65 ms min = 22.07 ms max = 63.47 ms
Addresses purescript#131 The relevant chunk sizes (5 for the initial list segment), (3 for the tail-recursive remainder) were arrived at through benchmarked experimentation, mapping a simple (_ + 1) through lists of various sizes. Relevant figures: list of 1000 elems: 142.61 μs -> 36.97 μs list of 2000 elems: 275.17 μs -> 55.33 μs list of 10000 elems: 912.73 μs -> 208.39 μs list of 100000 elems: 34.56 ms -> 1.24 ms The ~30x speed increase for long lists is probably explained by the lack of GC thrashing with this approach. Benchmarked on 2017 Macbook Pro, 2.3 GHz Intel Core i5, 8 GB RAM. macOS Sierra 10.12.6 node v8.9.1
this lower the probability of stack-size troubles
659f14e
to
8dff4a2
Compare
rebased. |
Given the possible stack-safety issue, what do people think about providing the naive (safe) implementation as a separate named function? Or vice versa? |
One possibility is to just use the stack-safe unrolled operation, which should still give a speed boost, and recommend that people turn to |
Would it be possible to try that and compare the performance easily? |
ca5d02f
to
cadf481
Compare
I made a branch that only uses the stack-safe reverse map and... It's actually faster. map: list (2000 elems) map: list (10000 elems) I've merged it in. |
Just realized I can also inline the |
begin with reverse unrolled map
6333855
to
5e653a9
Compare
Actually, inlining |
Even better, thanks! Any other comments before I merge this? |
Merged in via #157. |
Addresses #131
The relevant chunk sizes (5 for the initial list segment), (3 for the
tail-recursive remainder) were arrived at through benchmarked
experimentation, mapping a simple (_ + 1) through lists of various
sizes.
Relevant figures:
list of 1000 elems: 142.61 μs -> 36.97 μs
list of 2000 elems: 275.17 μs -> 55.33 μs
list of 10000 elems: 912.73 μs -> 208.39 μs
list of 100000 elems: 34.56 ms -> 1.24 ms
The ~30x speed increase for long lists is probably explained by the lack
of GC thrashing with this approach.