Skip to content

Conversation

@nikomatsakis
Copy link
Member

@nikomatsakis nikomatsakis commented Jun 10, 2017

These introduce two new core APIs that were suggested by @bholley for use in Servo (Stylo, in particular). They are insta-stable because I would want to make a public release so that Servo can make use of them.

The APIs are:

  • Introduce a new "breadth-first" scheduling mode that can be enabled per thread-pool. In breadth-first mode, both the current thread and stealer threads draw from the bottom of the deque. Depending on your workload, this may be desirable (in the case of stylo, it leads to better cache reuse). Users can opt into breadth-first mode by adjusting the configuration.
  • Introduce a new spawn_tail() API that is intended to be used for the last task that will be spawned. It is intended as a slight optimization on calling spawn and then immediately returning from the scope, particularly in breadth-first mode. Specifically, it will invoke the closure directly if that closure would have been the next task that the local thread executes anyway.

These changes introduce an if into one of the main hot-paths of Rayon, so I was a bit nervous about performance. However, keep in mind that the if is 100% predictable in normal usage (i.e., if you don't mix breadth- and depth-first modes). Both @bholley and I did measurements that suggest basically zero impact on depth-first code. Also, using the newer modes leads to a 20% improvement in performance for Stylo. The stylo measurements are documented here. My own measurements, gathered on my 8x2-core linux desktop, are visible here:

In the median view, you can see some effect, but it is only a few percent, and I am not that confident in my measurement setup. =) Because of my inability to convince GNUplot to use colors etc in median mode (have to tinker with that more), it's a bit hard to tell which were affected; but if you hover the cursor over, you can see that join_recursively, increment_all_max, and increment_all are the ones that took more time, whereas fibonacci and increment_all_min took less time. All of these are micro-benchmarks.

I was considering putting these behind a cargo feature (perhaps one enabled by default), so that users could "opt-out" from having the if. This seems like a plausible path, but the measurements suggest to me that it's not necessary, and it seems better to avoid having "options" if we can (so as to reduce the need to measure an exponentially increasing number of combinations).

@nikomatsakis
Copy link
Member Author

I think now that I am re-considering the spawn_tail() API in particular. Perhaps it would be better to expose a method on Scope that simply tells you whether the local deque is empty or not. In that case, code using BFS mode that used to call scope.spawn_tail(closure) can instead do:

if scope.local_thread_has_pending_tasks() {
    scope.spawn(closure)
} else {
    closure()
}

There are two reasons for this:

  • spawn_tail doesn't make sense in depth-first mode at all, since it is equivalent to just calling the closure.
  • even in BF-mode, you probably would prefer to use a loop or something to avoid making your stack grow deeply.

For example:

let mut parent = some_node();
loop {
    let work_items = parent.children();

    // avoid spawning a task if (a) no other work is pending and (b) only one thing
    if work_items.len() == 1 && !scope.local_thread_has_pending_tasks() {
        parent = work_items[0];
        continue;
    }

    for child in work_items { scope.spawn(move || process(child)); }
}

@bholley -- thoughts?

if !self.breadth_first {
self.worker.pop()
} else {
self.stealer.steal()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just published a new version of Coco that introduces Worker::steal method. Now you can call self.worker.steal() here. This change will give some performance wins in breadth-first mode. I'd be curious to know how much :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice!

@nikomatsakis
Copy link
Member Author

@stjepang I updated to use the new APIs. @bholley will have to re-run the measurements, though since we actually don't have any benchmarks employing BF mode (I've been meaning to make a pared down version of stylo for that purpose, but haven't gotten around to it.)

@nikomatsakis
Copy link
Member Author

nikomatsakis commented Jun 10, 2017

Regarding the "local thread has pending tasks" idea, I guess that the best thing would be to expose it as a free function, rather like current_thread_index (well, that's presently only a method on ThreadPool, but I think we should rectify that). And it would probably be called pending_local_tasks() or something, with a comment explaining that this is an approximation that indicates whether the current thread has published unstolen tasks. Because of the inherent raciness, a return of false is meaningful here, but true maybe not.

Interestingly, I've seen this proposed elsewhere as a good heuristic for whether to spawn a new task in general. In that work, it was called "being hungry", I think. The idea was: if you have no pending tasks, then this implies that other threads are "hungry" because they've stolen all of your tasks. But if you do have pending tasks, you don't necessarily want to go generating more, since they are unlikely to be consumed.

(I'm a bit wary of 'locking in' on work-stealing here, but this seems like an API that we could continue to support even if we changed to something else, even if it meant always returning false or true. And anyway I can't really imagine changing to something else.)

@bholley
Copy link
Contributor

bholley commented Jun 10, 2017

Yes, I would prefer the accessor over spawn_tail, which would be more ergonomic for our use-case.

@nikomatsakis
Copy link
Member Author

@bholley I pushed a change to current_thread_has_pending_tasks(), which is offered as a method on the ThreadPool (much like current_thread_index()).

I tend to think we should mirror all of these methods at the top-level as well. The main reason that I've been hesitating (particularly with respect to current_thread_index()) is that I was wary that if we ever changed our thread-pool strategy to dynamically change the number of threads, this would invalidate a lot of existing code. As long as the methods are restricted to the thread-pool, this seems harmless enough, since we could make it an "opt-in" bit of configuration (but make it the default for the global thread-pool).

But I am beginning to wonder if we will ever do this change -- or at least if we would ever do it without some form of opt-in. There doesn't seem to be much reason to do it, except to try to scale gracefully with blocking I/O tasks. And based on conversations I've had with others, that is best done via some opt-in APIs that wrap the blocking I/O, which would serve the purpose.

Therefore, I think I'd like to ensure that -- to the extent it makes sense -- the API surface of ThreadPool is mirrored with top-level functions that act on whatever the currently executing thread-pool is. That is, if you do thread_pool.foo(), then it is local to he given thread-pool, but foo() just works in your current context, whatever that is.

(@cuviper, what do you think of this reasoning?)

@bholley
Copy link
Contributor

bholley commented Jun 12, 2017

SGTM. FWIW I think that making the thread pools dynamic would be a pretty surprising change, and we'd probably want it to be opt-in.

@nikomatsakis
Copy link
Member Author

@bholley

FWIW I think that making the thread pools dynamic would be a pretty surprising change, and we'd probably want it to be opt-in.

Yes, and also I'm not keen on the "default" configuration (i.e., Configuration::new()) and the configuration that the default scheduler uses diverging from one another.

@nikomatsakis
Copy link
Member Author

nikomatsakis commented Jun 12, 2017

I've mirrored the current_foo() APIs at the global level now. I carried over the existing "future compatibility note" warnings, though I suspect that we would not make such a change. Maybe we could remove them, but...

Copy link
Member

@cuviper cuviper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm good with this. I think it's a little hard to reason about when it would be beneficial, but the docs are sufficiently squishy about promising anything that I think we can just say "measure it".

There are a few simple errors noted in CI here, though oddly only on OSX... how did Linux pass!?! The more recent CI run appears to be stuck in the queue...

///
/// [m]: struct.ThreadPool.html#method.current_thread_has_pending_tasks
#[inline]
pub fn current_thread_has_pending_tasks(&self) -> Option<bool> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no &self in free functions...

mod util;

pub use thread_pool::ThreadPool;
pub use thread_pool::current_num_threads;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already a current_num_threads below, and it's a little different because it reports on the global pool if you're not in a pool already.

if curr.is_null() {
None
} else {
Some((*curr).registry.num_threads())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field registry is private.

}

#[inline]
pub fn breadth_first(&self) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method used anywhere?

@nikomatsakis nikomatsakis force-pushed the servo-bfs-and-spawn_tail branch from 19d71d7 to 7b25a39 Compare June 12, 2017 21:32
@nikomatsakis
Copy link
Member Author

@cuviper um yeah obviously I didn't really build that. I pushed a new revision that addresses your various points. I removed the new current_num_threads in favor of the existing one. I had forgotten we made that available! I guess it's fine that it returns usize instead of Option<usize> -- that's probably what you want anyway, I imagine (i.e., you can measure the number of threads before you start your parallel operation, and this tells you how many threads you will have, unless you switch thread-pools in the meantime).

Copy link
Member

@cuviper cuviper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@cuviper cuviper merged commit 3eaba54 into master Jun 13, 2017
@cuviper cuviper deleted the servo-bfs-and-spawn_tail branch October 6, 2017 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants