Skip to content

Commit e3534a7

Browse files
authored
Merge pull request #1 from nikomatsakis/scope-scheduling
scope scheduling RFC
2 parents 9a99966 + 2b3045b commit e3534a7

File tree

1 file changed

+334
-0
lines changed

1 file changed

+334
-0
lines changed
Lines changed: 334 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,334 @@
1+
# Summary
2+
[summary]: #summary
3+
4+
- Introduce a new function, `scope_fifo`, which introduces a Rayon scope
5+
that executes tasks in **per-thread FIFO** order; in this mode, at
6+
least with one worker thread, the tasks that are spawned first execute
7+
first. This is on contrast to the tradition Rayon `scope`, which
8+
executes in **per-thread LIFO** order, such that tasks that are
9+
spawned first execute last. Per-thread FIFO requires a small amount of
10+
indirection to implement but is important for some use-cases.
11+
- Introduce a new function, `spawn_fifo`, that pushes tasks onto the
12+
implicit global scope. These tasks will also execute in FIFO order,
13+
in contrast to the [existing `spawn` function][spawn].
14+
- Deprecate the [existing `breadth_first` flag on `ThreadPool`][bf].
15+
Users should migrate to creating a `scope_fifo` instead, as it is better
16+
behaved.
17+
- In the future, the `breadth_first` flag may be converted to a no-op.
18+
19+
[spawn]: https://docs.rs/rayon/1.0.3/rayon/fn.spawn.html
20+
[bf]: https://docs.rs/rayon/1.0.2/rayon/struct.ThreadPoolBuilder.html#method.breadth_first
21+
22+
# Motivation
23+
[motivation]: #motivation
24+
25+
The prioritization order for tasks can sometimes make a big difference
26+
to the overall system efficiency. Currently, Rayon offers only a
27+
single knob for tuning this behavior, in the form of [the
28+
`breadth_first` option][bf] on builds. This knob is not only rather
29+
coarse, it can lead to quite surprising behavior when one intermingles
30+
`scope` and `join` (including stack overflows, see [#590]). The goal
31+
of this RFC is to make more options available to users while ensuring
32+
that these options "compose well" with the overall system.
33+
34+
[#590]: https://github.com/rayon-rs/rayon/issues/590
35+
36+
## Current behavior: Per-thread LIFO
37+
38+
By default, and presuming no stealing occurs, the current behavior of
39+
a Rayon scope is to execute tasks in the reverse of the order that
40+
they are created. Therefore, in the following code, task 2 would
41+
execute first, and then task 1:
42+
43+
```rust
44+
rayon::scope(|scope| {
45+
scope.spawn(|scope| /* task 1 */ );
46+
scope.spawn(|scope| /* task 2 */ );
47+
});
48+
```
49+
50+
Thieves, in contrast, steal tasks in the order that they are created,
51+
so a thief would steal Task 1 before Task 2. Once a task is stolen,
52+
any new tasks that it spawns are processed first by the thief, again
53+
in reverse order (but may in turn be stolen by others, again in the
54+
order of creation).
55+
56+
**Implementation notes.** This behavior corresponds very nicely with
57+
the general "work stealing" implementation, where each thread has a
58+
deque of tasks. Each new task is pushed on to the back of the
59+
deque. The thread pops tasks from the back of the deque when executing
60+
locally, but steals from the front of the deque.
61+
62+
## Per-thread FIFO
63+
64+
Unfortunately, for some applications, executing tasks in the reverse
65+
order turns out to be undesirable. One such application is stylo, the
66+
parallel rendering engine used in Firefox. The "basic loop" in Stylo is
67+
a simple tree walk that descends the DOM tree, spawning off a task for each
68+
element:
69+
70+
```rust
71+
fn style_walk<'scope>(
72+
element: &'scope Element,
73+
scope: &rayon::Scope<'scope>,
74+
) {
75+
style(element);
76+
for child in element {
77+
scope.spawn(|scope| style_walk(child, scope));
78+
}
79+
}
80+
```
81+
82+
For efficiency, Stylo employs a per-worker-thread cache, which enables
83+
it to share information between tasks. For this task to be maximally
84+
effective, it is best to process all of the children of a given
85+
element first, before processing its "grandchildren" (this is because
86+
sibling tasks require the same cached information, but grandchildren
87+
may not). However, if we use the default scheduling strategy of
88+
per-thread LIFO, this is not what we get: instead, for a given element
89+
`E`, we would process first its last child `Cn`. Processing `Cn` would
90+
push more tasks onto the thread-local deque (the grandchildren of `E`)
91+
and those would be the next to be processed. In effect, we are getting
92+
a depth-first search when what we wanted was a breadth-first search.
93+
94+
To address this, we currently offer a per threadpool flag called
95+
[`breadth_first`][bf]. This causes us to (globally) process tasks
96+
from the front of the queue first. This works well for Stylo, but
97+
[interacts poorly with parallel iterators][#590], as previously
98+
mentioned.
99+
100+
Instead of a flag on the threadpool, this RFC proposes to allow users
101+
to select the scheduling when constructing a scope. So stylo would
102+
create its scope using `rayon::scope_fifo` instead of `rayon::scope`:
103+
104+
```rust
105+
rayon::scope_fifo(|scope| {
106+
...
107+
});
108+
```
109+
110+
Creating a FIFO scope means that, when a thread goes to process a task
111+
that is part of the scope, it prefers first the tasks that were
112+
created most recently **within the current thread**. If we assume no
113+
stealing, then all tasks are created by one thread, and hence this is
114+
simply a FIFO ordering.
115+
116+
However, when stealing occurs, the ordering can get more complex and
117+
does not abide by a strict FIFO ordering. To illustrate, imagine that
118+
a thread T1 creates a scope S and creates three tasks, A, B and C
119+
within the scope. This thread T1 will begin by executing A, as it is
120+
the task created first. Let us imagine that processing A takes a very
121+
long time, and all the rest of the work proceeds before it completes.
122+
123+
Next, imagine that a thief thread T2 steals the task B. In executing
124+
B, it creates two more tasks, D and E. Once T2 finishes with B, it
125+
will proceed to execute D and E, in that order (presuming they are not
126+
stolen first). Only when it completes E will it go back to T1 and
127+
steal the task C. So the final ordering of *task creation* is A, B, C,
128+
D, E, but the tasks *begin* execution in a different order: A, B, D,
129+
E, C. This order is influenced by what gets stolen and when.
130+
131+
As it happens, this "per-thread FIFO" behavior is a very good fit for
132+
Stylo. It enables each worker thread to keep a cache in its
133+
thread-local data. When T2 steals the task B, its local cache is
134+
primed to process B's children, i.e., D and E. If T2 were to go and
135+
process C now, it would not benefit at all from the cache built up
136+
when processing B.
137+
138+
In fact, similar logic likely applies to many other applications: if
139+
nothing else, the caches on the CPU itself contain the state accessed
140+
from B, and it is likely that the tasks spawned by B are going to
141+
re-use more of those cache lines than task C. (In general, this is why
142+
we prefer a LIFO behavior, as it offers similar benefits.)
143+
144+
# Guide-level explanation
145+
146+
## New functions and types
147+
148+
We extend the rayon-core API to include two new functions, as well as two
149+
corresponding methods on the `ThreadPool` struct:
150+
151+
- `scope_fifo(..)` and `ThreadPool::scope_fifo`
152+
- `spawn_fifo(..)` and `ThreadPool::spawn_fifo`
153+
154+
These two functions (and methods) are analogous to the existing
155+
[`scope`] and [`spawn`] functions respectively, except that they
156+
ensure **per-thread FIFO** ordering.
157+
158+
The `scope_fifo` function (and method) takes a closure implementing
159+
`FnOnce(&ScopeFifo<'scope>)` as argument. The `ScopeFifo` struct (also
160+
introduced by this RFC) is analogous to existing [`Scope`] struct --
161+
it permits one to spawn new tasks that will execute before the
162+
`scope_fifo` function returns. It will offer one method,
163+
`ScopeFifo::spawn_fifo`, that permits one to spawn a (FIFO) task into
164+
the scope, analogous to [`Scope::spawn`].
165+
166+
[`scope`]: https://docs.rs/rayon/1.0.3/rayon/fn.scope.html
167+
[scope_method]: https://docs.rs/rayon/1.0.3/rayon/struct.ThreadPool.html#method.scope
168+
[`spawn`]: https://docs.rs/rayon/1.0.3/rayon/fn.spawn.html
169+
[`Scope`]: https://docs.rs/rayon/1.0.3/rayon/struct.Scope.html
170+
[`Scope::spawn`]: https://docs.rs/rayon/1.0.3/rayon/struct.Scope.html#method.spawn
171+
172+
## Deprecations
173+
174+
The `breadth_first` flag on thread-pools is **deprecated** but (for
175+
now) retains its current behavior. In some future rayon-core release,
176+
it may become a no-op, so users are encouraged to migrate to use
177+
`scope_fifo` or `spawn_fifo` instead.
178+
179+
# Implementation notes
180+
181+
## Implementing `scope_fifo`
182+
183+
Rayon's core thread pool operates on the traditional work-stealing
184+
mechanism, where each worker thread has its own deque. New tasks are
185+
pushed onto the back of the deque, and to obtain a local task, the
186+
thread pops from the back of the deque. When stealing, jobs are taken
187+
from the front of the deque. So how can we extend this to accommodate
188+
the new scheduling modes?
189+
190+
In addition, we have the goal that nested scopes using these modes
191+
should "compose" nicely with one another (and with the `join`
192+
operation)[^global]. So, for example, if we have a thread nesting scopes like:
193+
194+
[^global]: The existing "global FIFO mode" fails miserably on this criteria, which is a partial motivator for this RFC.
195+
196+
- a per-thread LIFO scope S1 that contains a
197+
- per-thread FIFO scope S2 that contains
198+
- a join(A, B) of tasks A and B
199+
200+
then we should execute:
201+
202+
- first, the tasks from the join (in reverse order, so A and then B)
203+
- then, the tasks from S2, in the order that they were created
204+
- then, the tasks from S1, in the reverse order from which they were created.
205+
206+
Implementing **Per-thread LIFO** scheduling is therefore very
207+
easy. Each new job pushed onto the stack is simply pushed directly
208+
onto the back of the deque. Worker threads pop local tasks from the
209+
back of the deque but steal from the front (if the `breadth_first`
210+
flag is true, then worker threads pop local tasks from the front as
211+
well).
212+
213+
Implementing **Per-thread FIFO** requires a certain amount of
214+
indirection. The scope creates N FIFOs, one per worker thread (as of
215+
this writing, the number of worker threads is fixed; if we later add
216+
the ability to grow or shrink the thread-pool dynamically, that will
217+
make this implementation somewhat more complex). When a new task is
218+
pushed onto a FIFO scope by the worker with index W, we actually push two items:
219+
220+
- First, we push the task itself onto the FIFO with index W.
221+
This task contains the closure that needs to execute.
222+
- Second, we push an "indirect" task onto the worker's thread-local
223+
deque. This task contains a reference to the FIFO for the worker
224+
index W that created it, but does not record the actual closure that
225+
needs to execute.
226+
227+
Like any other task, this "indirect task" may be popped locally or
228+
stolen. In either case, when it executes, it must first find the
229+
closure to execute. To do that, it will find the FIFO for the worker W
230+
that created it and pop the next closure from the front, which it can
231+
then execute. Any new tasks pushed by this task T will be pushed onto
232+
the *current* thread, which is not necessarily the thread W that
233+
created it.
234+
235+
Note that the FIFO mode does impose some "indirection overhead"
236+
relative to the LIFO execution mode. This is partly what motivates the
237+
default, as well backwards compatibility concerns. In any case, the
238+
overhead does not seem to be too large: the [prototype
239+
implementation][] of these ideas was [evaluated experimentally by
240+
@bholley][experiment], who found that it performs equivalently to
241+
today's code.
242+
243+
[prototype implementation]: https://github.com/rayon-rs/rayon/pull/601#issuecomment-433242023
244+
[experiment]: https://github.com/rayon-rs/rayon/pull/601#issuecomment-433242023
245+
246+
## Implementing `spawn_fifo`
247+
248+
The traditional `spawn` function used by Rayon behaves (roughly) "as
249+
if" there were a global scope surrounding the worker thread: presuming
250+
that spawn is executed from inside a worker thread, it simply pushes
251+
the task onto the current thread-local deque. (When executed from
252+
**outside** a worker thread, the task is added to the "global
253+
injector"; it will eventually be picked up by some worker thread and
254+
executed.)
255+
256+
`spawn_fifo` can be implemented in an analogous way to `scope_fifo` by
257+
having each worker thread have a global FIFO, analogous to the FIFOs
258+
created in each `scope_fifo`. `spawn_fifo` then pushes the true task
259+
onto this FIFO as well as an "indirect task" onto the thread-local
260+
deque, exactly as described above.
261+
262+
# Rationale and alternatives
263+
264+
**Why change at all?** Two things are clear:
265+
266+
- There is a need for FIFO-oriented scheduling, to accommodate Stylo.
267+
- The existing, global implementation -- while simple -- has serious
268+
drawbacks. It works well for Stylo, but it doesn't "compose" well
269+
with other code, that may wish to use parallel iterators or
270+
join-based invocations.
271+
272+
These two facts together motivate moving to something like the
273+
per-thread FIFO behavior described in this RFC.
274+
275+
**Why offer both FIFO and LIFO modes?** A serious alternative however
276+
would be to offer **only** this behavior, and not support per-thread
277+
LIFO -- this is the design which e.g. [@stjepang is advocating
278+
for][stjepang-1]. In general, Rayon prefers offering fewer knobs, so
279+
that would be a general fit. However, there are some advantages to
280+
offering more scheduling modes:
281+
282+
[stjepang-1]: https://github.com/rayon-rs/rfcs/pull/1#issuecomment-437074748
283+
284+
- Per-thread LIFO is the **current default behavior** of scopes. This
285+
is "semi-documented". While not strictly part of our semver
286+
guarantees, altering this behavior could negatively affect existing
287+
applications.
288+
- Per-thread LIFO offers the **most efficient** implementation in a
289+
"micro" sense, as it can build directly on the work-stealing deques
290+
and does not require any indirection. It also has desirable cache
291+
behavior, as threads will tend to continue the "most recent thing"
292+
you were doing, which is also the thing where the caches are
293+
warmest, etc. **These two cases together argue that, for cases where
294+
the application doesn't otherwise care about execution order,
295+
per-thread LIFO is a good choice.** This also applies to
296+
applications that wish to choose another scheduling constraint (see
297+
the next section).
298+
299+
**Should we offer additional scheduling modes?** Another question
300+
worth considering is "why stop with these two modes"? For example,
301+
the [Rayon demo application for solving the Travelling Salesman
302+
Problem][tsp] also [uses the "indirect task"
303+
trick][tsp-indirect]. However, in that case, it uses a (locked)
304+
priority queue to control the ordering. Similarly, earlier versions of
305+
this RFC considered a "global FIFO" ordering where tasks always
306+
executed in the order they were pushed, regardless of whether they
307+
were stolen or not. Rayon could conceivably offer some more flexible,
308+
priority-queue like mechanism as well. However, it's not clear that
309+
this is worth doing, since one can always achieve the same effect in
310+
"user space", as the TSP application does. (For this purpose,
311+
per-thread LIFO tasks are a great fit, as they add minimal overhead.)
312+
313+
[tsp]: https://github.com/rayon-rs/rayon/blob/a68b05ce524f79d7e7a5065714a8d3ca40ce8d4b/rayon-demo/src/tsp/
314+
[tsp-indirect]: https://github.com/rayon-rs/rayon/blob/a68b05ce524f79d7e7a5065714a8d3ca40ce8d4b/rayon-demo/src/tsp/step.rs#L50-L51
315+
316+
**Should `scope_fifo` return a [`Scope`] and not a `ScopeFifo`?** The
317+
[prototype implementation] shares the same `Scope` type for both
318+
`scope_fifo` and `scope`, and stores a boolean value to remember which
319+
"mode" is in use. This RFC proposes a distinct return type, which
320+
gives us the freedom to avoid using a dynamic boolean at runtime
321+
(though the implementation is not required to take advantage of that
322+
freedom).
323+
324+
**What to do with the `breadth_first` flag?** Earlier drafts of this
325+
RFC proposed making the `breadth_first` flag a no-op immediately. It
326+
was decided however to simply deprecate the flag but keep its current
327+
behavior for the time being: users of `breadth_first` are encouraged
328+
to migrate to `scope_fifo`, however, since the `breadth_first` flag
329+
may become a no-op in the future (this would simplify the overall
330+
rayon-core implementation somewhat).
331+
332+
# Unresolved questions
333+
334+
None.

0 commit comments

Comments
 (0)