|
| 1 | +# Summary |
| 2 | +[summary]: #summary |
| 3 | + |
| 4 | +- Introduce a new function, `scope_fifo`, which introduces a Rayon scope |
| 5 | + that executes tasks in **per-thread FIFO** order; in this mode, at |
| 6 | + least with one worker thread, the tasks that are spawned first execute |
| 7 | + first. This is on contrast to the tradition Rayon `scope`, which |
| 8 | + executes in **per-thread LIFO** order, such that tasks that are |
| 9 | + spawned first execute last. Per-thread FIFO requires a small amount of |
| 10 | + indirection to implement but is important for some use-cases. |
| 11 | +- Introduce a new function, `spawn_fifo`, that pushes tasks onto the |
| 12 | + implicit global scope. These tasks will also execute in FIFO order, |
| 13 | + in contrast to the [existing `spawn` function][spawn]. |
| 14 | +- Deprecate the [existing `breadth_first` flag on `ThreadPool`][bf]. |
| 15 | + Users should migrate to creating a `scope_fifo` instead, as it is better |
| 16 | + behaved. |
| 17 | + - In the future, the `breadth_first` flag may be converted to a no-op. |
| 18 | + |
| 19 | +[spawn]: https://docs.rs/rayon/1.0.3/rayon/fn.spawn.html |
| 20 | +[bf]: https://docs.rs/rayon/1.0.2/rayon/struct.ThreadPoolBuilder.html#method.breadth_first |
| 21 | + |
| 22 | +# Motivation |
| 23 | +[motivation]: #motivation |
| 24 | + |
| 25 | +The prioritization order for tasks can sometimes make a big difference |
| 26 | +to the overall system efficiency. Currently, Rayon offers only a |
| 27 | +single knob for tuning this behavior, in the form of [the |
| 28 | +`breadth_first` option][bf] on builds. This knob is not only rather |
| 29 | +coarse, it can lead to quite surprising behavior when one intermingles |
| 30 | +`scope` and `join` (including stack overflows, see [#590]). The goal |
| 31 | +of this RFC is to make more options available to users while ensuring |
| 32 | +that these options "compose well" with the overall system. |
| 33 | + |
| 34 | +[#590]: https://github.com/rayon-rs/rayon/issues/590 |
| 35 | + |
| 36 | +## Current behavior: Per-thread LIFO |
| 37 | + |
| 38 | +By default, and presuming no stealing occurs, the current behavior of |
| 39 | +a Rayon scope is to execute tasks in the reverse of the order that |
| 40 | +they are created. Therefore, in the following code, task 2 would |
| 41 | +execute first, and then task 1: |
| 42 | + |
| 43 | +```rust |
| 44 | +rayon::scope(|scope| { |
| 45 | + scope.spawn(|scope| /* task 1 */ ); |
| 46 | + scope.spawn(|scope| /* task 2 */ ); |
| 47 | +}); |
| 48 | +``` |
| 49 | + |
| 50 | +Thieves, in contrast, steal tasks in the order that they are created, |
| 51 | +so a thief would steal Task 1 before Task 2. Once a task is stolen, |
| 52 | +any new tasks that it spawns are processed first by the thief, again |
| 53 | +in reverse order (but may in turn be stolen by others, again in the |
| 54 | +order of creation). |
| 55 | + |
| 56 | +**Implementation notes.** This behavior corresponds very nicely with |
| 57 | +the general "work stealing" implementation, where each thread has a |
| 58 | +deque of tasks. Each new task is pushed on to the back of the |
| 59 | +deque. The thread pops tasks from the back of the deque when executing |
| 60 | +locally, but steals from the front of the deque. |
| 61 | + |
| 62 | +## Per-thread FIFO |
| 63 | + |
| 64 | +Unfortunately, for some applications, executing tasks in the reverse |
| 65 | +order turns out to be undesirable. One such application is stylo, the |
| 66 | +parallel rendering engine used in Firefox. The "basic loop" in Stylo is |
| 67 | +a simple tree walk that descends the DOM tree, spawning off a task for each |
| 68 | +element: |
| 69 | + |
| 70 | +```rust |
| 71 | +fn style_walk<'scope>( |
| 72 | + element: &'scope Element, |
| 73 | + scope: &rayon::Scope<'scope>, |
| 74 | +) { |
| 75 | + style(element); |
| 76 | + for child in element { |
| 77 | + scope.spawn(|scope| style_walk(child, scope)); |
| 78 | + } |
| 79 | +} |
| 80 | +``` |
| 81 | + |
| 82 | +For efficiency, Stylo employs a per-worker-thread cache, which enables |
| 83 | +it to share information between tasks. For this task to be maximally |
| 84 | +effective, it is best to process all of the children of a given |
| 85 | +element first, before processing its "grandchildren" (this is because |
| 86 | +sibling tasks require the same cached information, but grandchildren |
| 87 | +may not). However, if we use the default scheduling strategy of |
| 88 | +per-thread LIFO, this is not what we get: instead, for a given element |
| 89 | +`E`, we would process first its last child `Cn`. Processing `Cn` would |
| 90 | +push more tasks onto the thread-local deque (the grandchildren of `E`) |
| 91 | +and those would be the next to be processed. In effect, we are getting |
| 92 | +a depth-first search when what we wanted was a breadth-first search. |
| 93 | + |
| 94 | +To address this, we currently offer a per threadpool flag called |
| 95 | +[`breadth_first`][bf]. This causes us to (globally) process tasks |
| 96 | +from the front of the queue first. This works well for Stylo, but |
| 97 | +[interacts poorly with parallel iterators][#590], as previously |
| 98 | +mentioned. |
| 99 | + |
| 100 | +Instead of a flag on the threadpool, this RFC proposes to allow users |
| 101 | +to select the scheduling when constructing a scope. So stylo would |
| 102 | +create its scope using `rayon::scope_fifo` instead of `rayon::scope`: |
| 103 | + |
| 104 | +```rust |
| 105 | +rayon::scope_fifo(|scope| { |
| 106 | + ... |
| 107 | +}); |
| 108 | +``` |
| 109 | + |
| 110 | +Creating a FIFO scope means that, when a thread goes to process a task |
| 111 | +that is part of the scope, it prefers first the tasks that were |
| 112 | +created most recently **within the current thread**. If we assume no |
| 113 | +stealing, then all tasks are created by one thread, and hence this is |
| 114 | +simply a FIFO ordering. |
| 115 | + |
| 116 | +However, when stealing occurs, the ordering can get more complex and |
| 117 | +does not abide by a strict FIFO ordering. To illustrate, imagine that |
| 118 | +a thread T1 creates a scope S and creates three tasks, A, B and C |
| 119 | +within the scope. This thread T1 will begin by executing A, as it is |
| 120 | +the task created first. Let us imagine that processing A takes a very |
| 121 | +long time, and all the rest of the work proceeds before it completes. |
| 122 | + |
| 123 | +Next, imagine that a thief thread T2 steals the task B. In executing |
| 124 | +B, it creates two more tasks, D and E. Once T2 finishes with B, it |
| 125 | +will proceed to execute D and E, in that order (presuming they are not |
| 126 | +stolen first). Only when it completes E will it go back to T1 and |
| 127 | +steal the task C. So the final ordering of *task creation* is A, B, C, |
| 128 | +D, E, but the tasks *begin* execution in a different order: A, B, D, |
| 129 | +E, C. This order is influenced by what gets stolen and when. |
| 130 | + |
| 131 | +As it happens, this "per-thread FIFO" behavior is a very good fit for |
| 132 | +Stylo. It enables each worker thread to keep a cache in its |
| 133 | +thread-local data. When T2 steals the task B, its local cache is |
| 134 | +primed to process B's children, i.e., D and E. If T2 were to go and |
| 135 | +process C now, it would not benefit at all from the cache built up |
| 136 | +when processing B. |
| 137 | + |
| 138 | +In fact, similar logic likely applies to many other applications: if |
| 139 | +nothing else, the caches on the CPU itself contain the state accessed |
| 140 | +from B, and it is likely that the tasks spawned by B are going to |
| 141 | +re-use more of those cache lines than task C. (In general, this is why |
| 142 | +we prefer a LIFO behavior, as it offers similar benefits.) |
| 143 | + |
| 144 | +# Guide-level explanation |
| 145 | + |
| 146 | +## New functions and types |
| 147 | + |
| 148 | +We extend the rayon-core API to include two new functions, as well as two |
| 149 | +corresponding methods on the `ThreadPool` struct: |
| 150 | + |
| 151 | +- `scope_fifo(..)` and `ThreadPool::scope_fifo` |
| 152 | +- `spawn_fifo(..)` and `ThreadPool::spawn_fifo` |
| 153 | + |
| 154 | +These two functions (and methods) are analogous to the existing |
| 155 | +[`scope`] and [`spawn`] functions respectively, except that they |
| 156 | +ensure **per-thread FIFO** ordering. |
| 157 | + |
| 158 | +The `scope_fifo` function (and method) takes a closure implementing |
| 159 | +`FnOnce(&ScopeFifo<'scope>)` as argument. The `ScopeFifo` struct (also |
| 160 | +introduced by this RFC) is analogous to existing [`Scope`] struct -- |
| 161 | +it permits one to spawn new tasks that will execute before the |
| 162 | +`scope_fifo` function returns. It will offer one method, |
| 163 | +`ScopeFifo::spawn_fifo`, that permits one to spawn a (FIFO) task into |
| 164 | +the scope, analogous to [`Scope::spawn`]. |
| 165 | + |
| 166 | +[`scope`]: https://docs.rs/rayon/1.0.3/rayon/fn.scope.html |
| 167 | +[scope_method]: https://docs.rs/rayon/1.0.3/rayon/struct.ThreadPool.html#method.scope |
| 168 | +[`spawn`]: https://docs.rs/rayon/1.0.3/rayon/fn.spawn.html |
| 169 | +[`Scope`]: https://docs.rs/rayon/1.0.3/rayon/struct.Scope.html |
| 170 | +[`Scope::spawn`]: https://docs.rs/rayon/1.0.3/rayon/struct.Scope.html#method.spawn |
| 171 | + |
| 172 | +## Deprecations |
| 173 | + |
| 174 | +The `breadth_first` flag on thread-pools is **deprecated** but (for |
| 175 | +now) retains its current behavior. In some future rayon-core release, |
| 176 | +it may become a no-op, so users are encouraged to migrate to use |
| 177 | +`scope_fifo` or `spawn_fifo` instead. |
| 178 | + |
| 179 | +# Implementation notes |
| 180 | + |
| 181 | +## Implementing `scope_fifo` |
| 182 | + |
| 183 | +Rayon's core thread pool operates on the traditional work-stealing |
| 184 | +mechanism, where each worker thread has its own deque. New tasks are |
| 185 | +pushed onto the back of the deque, and to obtain a local task, the |
| 186 | +thread pops from the back of the deque. When stealing, jobs are taken |
| 187 | +from the front of the deque. So how can we extend this to accommodate |
| 188 | +the new scheduling modes? |
| 189 | + |
| 190 | +In addition, we have the goal that nested scopes using these modes |
| 191 | +should "compose" nicely with one another (and with the `join` |
| 192 | +operation)[^global]. So, for example, if we have a thread nesting scopes like: |
| 193 | + |
| 194 | +[^global]: The existing "global FIFO mode" fails miserably on this criteria, which is a partial motivator for this RFC. |
| 195 | + |
| 196 | +- a per-thread LIFO scope S1 that contains a |
| 197 | + - per-thread FIFO scope S2 that contains |
| 198 | + - a join(A, B) of tasks A and B |
| 199 | + |
| 200 | +then we should execute: |
| 201 | + |
| 202 | +- first, the tasks from the join (in reverse order, so A and then B) |
| 203 | +- then, the tasks from S2, in the order that they were created |
| 204 | +- then, the tasks from S1, in the reverse order from which they were created. |
| 205 | + |
| 206 | +Implementing **Per-thread LIFO** scheduling is therefore very |
| 207 | +easy. Each new job pushed onto the stack is simply pushed directly |
| 208 | +onto the back of the deque. Worker threads pop local tasks from the |
| 209 | +back of the deque but steal from the front (if the `breadth_first` |
| 210 | +flag is true, then worker threads pop local tasks from the front as |
| 211 | +well). |
| 212 | + |
| 213 | +Implementing **Per-thread FIFO** requires a certain amount of |
| 214 | +indirection. The scope creates N FIFOs, one per worker thread (as of |
| 215 | +this writing, the number of worker threads is fixed; if we later add |
| 216 | +the ability to grow or shrink the thread-pool dynamically, that will |
| 217 | +make this implementation somewhat more complex). When a new task is |
| 218 | +pushed onto a FIFO scope by the worker with index W, we actually push two items: |
| 219 | + |
| 220 | +- First, we push the task itself onto the FIFO with index W. |
| 221 | + This task contains the closure that needs to execute. |
| 222 | +- Second, we push an "indirect" task onto the worker's thread-local |
| 223 | + deque. This task contains a reference to the FIFO for the worker |
| 224 | + index W that created it, but does not record the actual closure that |
| 225 | + needs to execute. |
| 226 | + |
| 227 | +Like any other task, this "indirect task" may be popped locally or |
| 228 | +stolen. In either case, when it executes, it must first find the |
| 229 | +closure to execute. To do that, it will find the FIFO for the worker W |
| 230 | +that created it and pop the next closure from the front, which it can |
| 231 | +then execute. Any new tasks pushed by this task T will be pushed onto |
| 232 | +the *current* thread, which is not necessarily the thread W that |
| 233 | +created it. |
| 234 | + |
| 235 | +Note that the FIFO mode does impose some "indirection overhead" |
| 236 | +relative to the LIFO execution mode. This is partly what motivates the |
| 237 | +default, as well backwards compatibility concerns. In any case, the |
| 238 | +overhead does not seem to be too large: the [prototype |
| 239 | +implementation][] of these ideas was [evaluated experimentally by |
| 240 | +@bholley][experiment], who found that it performs equivalently to |
| 241 | +today's code. |
| 242 | + |
| 243 | +[prototype implementation]: https://github.com/rayon-rs/rayon/pull/601#issuecomment-433242023 |
| 244 | +[experiment]: https://github.com/rayon-rs/rayon/pull/601#issuecomment-433242023 |
| 245 | + |
| 246 | +## Implementing `spawn_fifo` |
| 247 | + |
| 248 | +The traditional `spawn` function used by Rayon behaves (roughly) "as |
| 249 | +if" there were a global scope surrounding the worker thread: presuming |
| 250 | +that spawn is executed from inside a worker thread, it simply pushes |
| 251 | +the task onto the current thread-local deque. (When executed from |
| 252 | +**outside** a worker thread, the task is added to the "global |
| 253 | +injector"; it will eventually be picked up by some worker thread and |
| 254 | +executed.) |
| 255 | + |
| 256 | +`spawn_fifo` can be implemented in an analogous way to `scope_fifo` by |
| 257 | +having each worker thread have a global FIFO, analogous to the FIFOs |
| 258 | +created in each `scope_fifo`. `spawn_fifo` then pushes the true task |
| 259 | +onto this FIFO as well as an "indirect task" onto the thread-local |
| 260 | +deque, exactly as described above. |
| 261 | + |
| 262 | +# Rationale and alternatives |
| 263 | + |
| 264 | +**Why change at all?** Two things are clear: |
| 265 | + |
| 266 | +- There is a need for FIFO-oriented scheduling, to accommodate Stylo. |
| 267 | +- The existing, global implementation -- while simple -- has serious |
| 268 | + drawbacks. It works well for Stylo, but it doesn't "compose" well |
| 269 | + with other code, that may wish to use parallel iterators or |
| 270 | + join-based invocations. |
| 271 | + |
| 272 | +These two facts together motivate moving to something like the |
| 273 | +per-thread FIFO behavior described in this RFC. |
| 274 | + |
| 275 | +**Why offer both FIFO and LIFO modes?** A serious alternative however |
| 276 | +would be to offer **only** this behavior, and not support per-thread |
| 277 | +LIFO -- this is the design which e.g. [@stjepang is advocating |
| 278 | +for][stjepang-1]. In general, Rayon prefers offering fewer knobs, so |
| 279 | +that would be a general fit. However, there are some advantages to |
| 280 | +offering more scheduling modes: |
| 281 | + |
| 282 | +[stjepang-1]: https://github.com/rayon-rs/rfcs/pull/1#issuecomment-437074748 |
| 283 | + |
| 284 | +- Per-thread LIFO is the **current default behavior** of scopes. This |
| 285 | + is "semi-documented". While not strictly part of our semver |
| 286 | + guarantees, altering this behavior could negatively affect existing |
| 287 | + applications. |
| 288 | +- Per-thread LIFO offers the **most efficient** implementation in a |
| 289 | + "micro" sense, as it can build directly on the work-stealing deques |
| 290 | + and does not require any indirection. It also has desirable cache |
| 291 | + behavior, as threads will tend to continue the "most recent thing" |
| 292 | + you were doing, which is also the thing where the caches are |
| 293 | + warmest, etc. **These two cases together argue that, for cases where |
| 294 | + the application doesn't otherwise care about execution order, |
| 295 | + per-thread LIFO is a good choice.** This also applies to |
| 296 | + applications that wish to choose another scheduling constraint (see |
| 297 | + the next section). |
| 298 | + |
| 299 | +**Should we offer additional scheduling modes?** Another question |
| 300 | +worth considering is "why stop with these two modes"? For example, |
| 301 | +the [Rayon demo application for solving the Travelling Salesman |
| 302 | +Problem][tsp] also [uses the "indirect task" |
| 303 | +trick][tsp-indirect]. However, in that case, it uses a (locked) |
| 304 | +priority queue to control the ordering. Similarly, earlier versions of |
| 305 | +this RFC considered a "global FIFO" ordering where tasks always |
| 306 | +executed in the order they were pushed, regardless of whether they |
| 307 | +were stolen or not. Rayon could conceivably offer some more flexible, |
| 308 | +priority-queue like mechanism as well. However, it's not clear that |
| 309 | +this is worth doing, since one can always achieve the same effect in |
| 310 | +"user space", as the TSP application does. (For this purpose, |
| 311 | +per-thread LIFO tasks are a great fit, as they add minimal overhead.) |
| 312 | + |
| 313 | +[tsp]: https://github.com/rayon-rs/rayon/blob/a68b05ce524f79d7e7a5065714a8d3ca40ce8d4b/rayon-demo/src/tsp/ |
| 314 | +[tsp-indirect]: https://github.com/rayon-rs/rayon/blob/a68b05ce524f79d7e7a5065714a8d3ca40ce8d4b/rayon-demo/src/tsp/step.rs#L50-L51 |
| 315 | + |
| 316 | +**Should `scope_fifo` return a [`Scope`] and not a `ScopeFifo`?** The |
| 317 | +[prototype implementation] shares the same `Scope` type for both |
| 318 | +`scope_fifo` and `scope`, and stores a boolean value to remember which |
| 319 | +"mode" is in use. This RFC proposes a distinct return type, which |
| 320 | +gives us the freedom to avoid using a dynamic boolean at runtime |
| 321 | +(though the implementation is not required to take advantage of that |
| 322 | +freedom). |
| 323 | + |
| 324 | +**What to do with the `breadth_first` flag?** Earlier drafts of this |
| 325 | +RFC proposed making the `breadth_first` flag a no-op immediately. It |
| 326 | +was decided however to simply deprecate the flag but keep its current |
| 327 | +behavior for the time being: users of `breadth_first` are encouraged |
| 328 | +to migrate to `scope_fifo`, however, since the `breadth_first` flag |
| 329 | +may become a no-op in the future (this would simplify the overall |
| 330 | +rayon-core implementation somewhat). |
| 331 | + |
| 332 | +# Unresolved questions |
| 333 | + |
| 334 | +None. |
0 commit comments