Skip to content

Conversation

@cuviper
Copy link
Member

@cuviper cuviper commented Oct 5, 2018

The former default behavior was that spawns would be processed in FIFO order from the global injector queue, but would effectively use LIFO order when a thread popped from the back of its local deque. The breadth_first() flag made it so threads always popped from the front of their own deque, just like a stealer would, and this made spawns look breadth-first/FIFO in all cases. However, #590 showed that this had a bad effect on stack-oriented jobs like parallel iterators.

Now we always make unscoped spawn use the global injector queue, so they're always FIFO.

For scoped spawn, it's trickier, because we still want the thread which owns that scope to prioritize its own items so it can return. So we want to use the local deque, but still achieve FIFO of its spawn jobs. Now each scope gets its own private queue, and each spawn is pushed on this queue and pushes a reference to that queue (ScopeQueue) as an indirect job on the local deque.

When a ScopeQueue job is executed, it pops the real job from the front of its queue and executes that. No matter whether the ScopeQueue was popped from the back of a thread's deque or stolen from the front, the actual spawn will be processed in a consistent FIFO order.

Fixes #590.

@cuviper cuviper requested a review from nikomatsakis October 5, 2018 19:32
@cuviper
Copy link
Member Author

cuviper commented Oct 5, 2018

cc @bholley, AIUI breadth_first() in #360 was your request to help Servo performance. It would be great if you could benchmark how this new scheme compares!

@bholley
Copy link
Contributor

bholley commented Oct 8, 2018

I spent some time testing this today. Full results are here: https://gist.github.com/bholley/88b0ff875061ecf7c74cef6b371d89f7

Summary of elements_matched (format: vanilla, vanilla+no-breadth-first, cuviper/fifo-spawn):

  • html5spec, single-threaded: 42236, 80730, 42236
  • html5spec, multi-threaded: 64613, 81392, 76199
  • wikipedia, single-threaded: 2719, 5558, 2719
  • wikipedia, multi-threaded: 4197, 5569, 5174

From the 'Analysis' section:

  • Single-thread performance is deterministic, multi-thread performance is not but distributed within a relatively-tight range.
  • Disabling breadth_first on existing rayon gives significantly worse sharing in both single-threaded and multi-threaded modes.
  • Under cuviper/fifo-spawn, single-threaded execution works as well as stock Firefox. However, multithreaded execution is significantly worse (though not as bad as disabling breadth_first).

@bholley
Copy link
Contributor

bholley commented Oct 8, 2018

I'll provide a bit more context here about what Stylo is doing.

When we style the DOM, we traverse it top-down, here:

https://searchfox.org/mozilla-central/source/servo/components/style/driver.rs#62

We start doing this sequentially (breadth-first). For every node in the DOM tree, we process it and then append its children to the work queue. When the queue gets big enough, we kick over to the rayon pool, where work stealing happens on units of 16 DOM nodes.

Each thread keeps a thread-local cache of previous styles it has seen that it might be able to reuse. Sharing only works between elements at the same level in the tree, which is why the breadth-first traversal is important. Moreover, since the cache is thread-local, and since sharing is more likely to be viable for nodes that are closer together, we generally want the threads to focus on distinct areas of the overall tree, rather than jumping around, to improve locality.

So my interpretation of the results here is that the new algorithm is disrupting that locality, causing us to share less when there are more threads. The loss of sharing here is significant enough that it would have a measurable impact on memory usage and performance in Firefox, which is something we want to avoid. Ideally this would not involve us forking rayon, so I'm hoping we can tweak the new setup a bit to preserve the old characteristics.

CC @emilio @bzbarsky @heycam @nikomatsakis

@cuviper
Copy link
Member Author

cuviper commented Oct 8, 2018

Thank you for testing, and for the additional context!

I think your thread-local cache was an effect that I didn't understand before. My change has made a global FIFO priority for spawns, but that doesn't actually help your localized cache so much. I'll have to think more about your scenario and how we might better support it.

Ideally this would not involve us forking rayon

No worries, we're not going to change breadth_first() until we have a satisfactory solution! AFAIK you're the only users of this mode, and the issues in #590 won't bite you unless you starting mixing things like parallel iterators in the same pool. I still hope we can do better, because the current semantics are pretty shaky, but we'll leave this behavior alone until we figure it out.

@bholley
Copy link
Contributor

bholley commented Oct 8, 2018

Great, thank you! Feel free to ping me if you have any questions about the peculiarities in our setup, or need any additional testing done.

@ghost
Copy link

ghost commented Oct 8, 2018

@cuviper

I'm afraid the queue-per-scope approach is not going to float. We're losing performance in three ways:

  1. All tasks spawned inside a scope are passing through the same queue. Threads will contend on push/pop operations.

  2. There's a too low chance that a scoped task spawned inside a worker thread will be executed by that same thread. This is because we don't push scoped tasks into job queues - we only push a "permit" to pop a task from the scope-local queue.

  3. Too many queue operations. Every scoped task now causes 2 push and 2 pop operations.

I suggest we do the following instead:

  • Get rid of the queue inside the scope. Instead, let's have a job queue inside every worker. So now every worker has a FIFO queue and a LIFO deque. There is also one global FIFO queue.

  • To spawn a task, push it into the local FIFO queue (if inside a worker) or into the global FIFO queue.

  • When calling rayon::join, push the task B into the local LIFO deque. After executing task A, keep executing tasks from the LIFO deque until we execute task B or the LIFO deque becomes empty. Then fall back to worker_thread.wait_until(&job_b.latch) as usual.

  • In function wait_until, keep executing tasks from the local FIFO queue. If it becomes empty, steal a task from the global FIFO queue. If that one is empty too, then try stealing a task from another worker's queue or deque.

Note that only rayon::join pops tasks from the local LIFO deque, and only wait_until pops tasks from the local FIFO queue.

This way the three problems are fixed:

  1. No contention - we have a queue per worker thread rather than per scope.

  2. A task spawned inside a worker thread gets pushed into the local FIFO queue. Therefore, there's a very high chance that same thread will execute it, thus taking advantage of its hot data being in cache.

  3. Every task is now pushed once and popped once. Even task popping is still efficient because rayon::join only looks into the LIFO deque, while wait_until only looks into the FIFO queue. Neither function has to check for both of them.

The local FIFO queues should probably be crossbeam-deques, while the global FIFO queue should probably be a SegQueue because it doesn't require a lock.

@cuviper
Copy link
Member Author

cuviper commented Oct 8, 2018

@stjepang I think you're over-simplifying it, but of course I could be over-complicating it too... :)

  1. All tasks spawned inside a scope are passing through the same queue. Threads will contend on push/pop operations.

  2. There's a too low chance that a scoped task spawned inside a worker thread will be executed by that same thread. This is because we don't push scoped tasks into job queues - we only push a "permit" to pop a task from the scope-local queue.

If we want spawns to have some global order (which is not necessarily agreed), then I think both of these points are necessary trade-offs.

  1. Too many queue operations. Every scoped task now causes 2 push and 2 pop operations.

While true, I think this is a micro-optimization in the grand scheme of things. Note that @bholley's analysis was primarily looking at high level metric, indicating the effectiveness of their cache. We should worry about these kind of effects first, because I expect the real workload to far outweigh these individual queue operations.

The rest of your idea of per-thread LIFO and FIFO queues is essentially what we already discussed earlier in #590. My main concern about that was nested/mixed mode, noted here: #590 (comment).

To restate that, suppose you have a scope in which you spawn some tasks. Those tasks each run a parallel iterator (recursive joins), and within that you start a new inner scope with some more spawns. So the call stack is now nested with scope -> join... -> scope. I assert that we really want the thread to finish that inner scope first, as those are the finest-grained work items, then we can unwind a bit before starting a new outer work item.

If that inner scope only has the one shared FIFO to pull from, then it will keep pulling big outer items before it ever gets to the smaller inner items. I fear you'll end up with something like scope -> (join... -> scope) -> (join... -> scope) -> (join... -> scope) -> (join... -> scope) ... and we can't unwind anything until we've completed every spawn this thread has ever seen!

(AFAIK, Servo/Firefox folks are not nesting scopes or using join or parallel iterators at all, so they don't have this particular problem.)

@ghost
Copy link

ghost commented Oct 9, 2018

Note that @bholley's analysis was primarily looking at high level metric, indicating the effectiveness of their cache.

The queue-per-scope strategy is ineffective in Firefox because, as @bholley explains, threads tend to jump around rather than focus on distinct areas of the overall tree. A thread that spawns a task must later pick it up with high probability.

Another suggestion would be to have num_workers queues in each scope, which can be accessed by thread indices. But this is more complicated and scope creation becomes more costly...

If that inner scope only has the one shared FIFO to pull from, then it will keep pulling big outer items before it ever gets to the smaller inner items.

Honestly, I wouldn't worry about it. Yes, the model where nested scopes are in LIFO order but tasks within scopes are in FIFO order is really wonderful. But... it's difficult to implement efficiently, nested scopes are not common in practice (don't know if they exist at all!), and the worst thing that can happen if we share queues among scopes is that some stack-allocated variables don't get dropped as soon as we'd ideally like. It's just not worth it, I think. :)

@cuviper
Copy link
Member Author

cuviper commented Oct 9, 2018

The queue-per-scope strategy is ineffective in Firefox because, as @bholley explains, threads tend to jump around rather than focus on distinct areas of the overall tree. A thread that spawns a task must later pick it up with high probability.

Yeah, I think I get it now -- what I'm searching for is how to generalize this. Firefox wants it this way because they have this thread-local cache. I suppose in general you could argue that CPU caches probably benefit in a similar way, although that would favor LIFO even more. (Keep working on the thing you just touched.)

The whole breadth-first thing puts us in a weird place where we care about FIFO order, but only relative to the thread, and I'm not sure this is something we can present well as a general solution.

Another suggestion would be to have num_workers queues in each scope, which can be accessed by thread indices.

Yeah, I was also thinking along those lines. That should be better at maintaining the current semantics, but I still feel like these are weird semantics in the first place. I get why it works for what Firefox is doing, but it's a lot of implicit behavior to depend on.

Honestly, I wouldn't worry about it. [...] the worst thing that can happen if we share queues among scopes is that some stack-allocated variables don't get dropped as soon as we'd ideally like.

If you get a parallel-iterator/join in there too, then we're back to #590. It's also possible that these things would not be explicitly nested, but end up like that anyway because of work stealing. I'm trying to figure out something that will be robust even in such weird cases.

@ghost
Copy link

ghost commented Oct 9, 2018

Ok, here's another idea that takes the best of both worlds. :)

Start with the design proposed in my previous comment. Each worker thread has a FIFO queue and a LIFO queue. Scopes don't have their own queues. Now we just need a tweak that prioritizes inner scopes over outer scopes.

The tweak is: when a worker thread creates a new scope, it temporarily swaps out its two queues for new ones. Once the scope completes, the old queues are put back. So how do we implement this?

Recall that we currently have a global Mutex<RegistryState>, and RegistryState has a Deque<JobRef> field. Think of it as the FIFO queue for the implicit global scope used by rayon::spawn. It might help to imagine that fn main is implicitly enclosed by a rayon::scope, and rayon::spawn spawns tasks in that scope.

We extend RegistryState by adding another field, a vector of FIFO queues and LIFO queues used by currently active scopes: Vec<(Stealer<JobRef>, Stealer<JobRef>)>. When a worker thread creates a new scope, it pushes a new FIFO queue and a new LIFO queue into that vector, which become the thread's main queues for the duration of the scope. Once the scope completes, we remove those queues.

That's pretty much it. Now, the thread that creates a scope will clearly prioritize its tasks over tasks from other scopes. But what about other threads - won't they mix up tasks belonging to different scopes in their own queues? Fortunately, they won't! Note that a thread will steal tasks only if both of its own queues are empty. Therefore, tasks from different scopes never end up in the same queue.

We could be a bit smarter about that Mutex around RegistryState and things like that, but... Overall, this design should prioritize inner scopes, have good performance (no duplicate queue operations, good cache locality!), and keep the old task scheduling strategy in Firefox. Everything works out.

@cuviper
Copy link
Member Author

cuviper commented Oct 9, 2018

We extend RegistryState by adding another field, a vector of FIFO queues and LIFO queues used by currently active scopes

You want all stealing operations to be synchronized through RegistryState?

@cuviper
Copy link
Member Author

cuviper commented Oct 9, 2018

Anyway @stjepang, I'd be happy to see you implement your ideas in a competing PR for comparison.

@cuviper
Copy link
Member Author

cuviper commented Oct 9, 2018

I get why it works for what Firefox is doing, but it's a lot of implicit behavior to depend on.

Along these lines, I'm trying to think of ways that Firefox could express this. One idea is that threads could explicitly create a new scope when they steal work. That could be done by detecting a changed current_thread_index(), or we could provide a context variable like join_context does. Or maybe present this as a different kind of thread-local scope that automatically "re-scopes" when stolen for the same effect.

The point is to make all of this easier to explain and think about how jobs will execute. We can then say that spawn is always globally FIFO to its scope, but we have a means to isolate a particular subset of your jobs by using a new scope, friendly to that thread-local cache.

@bholley
Copy link
Contributor

bholley commented Oct 10, 2018

To be clear, we don't want to prevent threads from jumping to other areas of the tree if they run out of work to do. We just want to minimizing far jumps when there's still nearby work.

@cuviper
Copy link
Member Author

cuviper commented Oct 10, 2018

Certainly, further work-stealing is still possible.

@cuviper
Copy link
Member Author

cuviper commented Oct 10, 2018

We discussed this issue in our meeting today:
https://gitter.im/rayon-rs/Lobby?at=5bbe3d21600c5f642386e123

I'm going to prototype the idea of using N queues (for each thread) in the scope, which should preserve the behavior Firefox wants right now. I think we should be able to get the same cache-hit performance, and then we can see how this affects raw performance numbers too.

Then I want to try to present these choices as explicit variants of scope, so we can let the user decide what fits them best. This assumes we can think of reasonable scenarios for each, and have a good default (probably the direct case).

  • freeform, direct queuing -- the status quo without BFS
  • indirect, scopes with a global FIFO queue -- this PR
  • indirect, scopes with per-thread FIFO queues -- the prototype above

Experiments from others (@stjepang) are still welcome too!

@nikomatsakis
Copy link
Member

nikomatsakis commented Oct 10, 2018

So @cuviper and I discussed some on gitter. I think we decided that — first — it makes sense to tinker with a few more variants and try to measure the performance, just to validate our hypotheses. In particular, I am interested in the variant where scopes have N FIFOs, one per worker thread. This preserves the intermixed FIFO/LIFO layout, at the cost of 2 pushes/pops, but also preserves thread locality (that is, threads prefer to run work they spawned).

What I like about this FIFO per variant case is that if you do the "very simple" thing of making one scope and spawning into it, you basically get the "classic work stealing" behavior, which time has shown to be a pretty decent default. I don't love the "indirection overhead" of requiring two queue operations, of course. I also worry that if one were using nested scopes, the overhead might be high, as you might have quite a lot of deques being created.

More generally, it occurred to us that we could present this as variations on spawn scope, giving some of this control to the user:

  • Global FIFO (what @cuviper implemented here)
  • FIFO per thread (what we proposed above)
  • LIFO per thread (what we have now, without the BFS switch)

The other interesting question is whether @stjepang is correct and we should not worry so much about intermingled LIFO/FIFO behavior. I could certainly imagine though that one using spawn to spawn tasks that internally use par_iter operations -- I guess the question is whether they might reasonably then use scopes themselves (since par_iter, at least today, relies exclusively on join). I've also considered parallel iterator adapters that might use spawn internally — though we've never actually gone that way — which would certainly be a case where things might intermingle.

azriel91 added a commit to azriel91/rayon that referenced this pull request Oct 11, 2018
@cuviper
Copy link
Member Author

cuviper commented Oct 23, 2018

Argh, I did prototype those ideas shortly after discussed, but neglected to share it.
https://github.com/cuviper/rayon/tree/thread-fifo-spawn

That branch has implementations of three modes: Direct (current non-BFS), GlobalFifo (this PR), and ThreadFifo (hopefully like the current BFS). I didn't change the public API yet, so at the moment it's just forced to ThreadFifo for testing. I would appreciate if you could see how this performs with Stylo!

@bholley
Copy link
Contributor

bholley commented Oct 25, 2018

I would appreciate if you could see how this performs with Stylo!

I applied the diff of the WIP commit to my local tree (which was the one I tested with fifo-spawn). IIUC, that's the only change.

Assuming I didn't mess something up, sharing behavior matches stock Firefox, so this would work great for us. Thank you!

Let me know when it's released, since I'll want to remeasure to be sure everything is working properly.

@nikomatsakis
Copy link
Member

Given @bholley's measurements, @cuviper, care to update this PR to that branch?

@cuviper
Copy link
Member Author

cuviper commented Nov 1, 2018

FYI, we're trying out a new RFC process for this in rayon-rs/rfcs#1.

@nikomatsakis
Copy link
Member

I think we can close this in favor of #615

bors bot added a commit that referenced this pull request Feb 2, 2019
615: Implement RFC #1: FIFO spawns r=nikomatsakis a=cuviper

This implements rayon-rs/rfcs#1, adding `spawn_fifo`, `scope_fifo`, and `ScopeFifo`, and deprecating the `breadth_first` flag.

Fixes #590.
Closes #601.

Co-authored-by: Josh Stone <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants