Priority inversion when executing jobs of heterogeneous length

We're using rayon to power our mathematical software for HPC and we're running into this issue: SpectralSequences/sseq#105. Here's a high-level overview.

We have two kinds of jobs, long ones that take several hours, and short ones that take milliseconds. Short jobs are spawned by `par_iter_mut`, they are independent and do not spawn other jobs themselves. Long jobs on the other hand have nontrivial dependencies on each other. We schedule them using channels and spawn them using `spawn` on an `in_place_scope`. Each long job will spawn potentially millions of short jobs that will need to finish before the long job returns. The inversion happens in the following situation:

- Thread 1 starts computing long job A.
- Thread 1 enters the `par_iter_mut`, which calls `join_context`, which itself spawns (say) two short jobs a and b.
- Thread 1 starts executing a while thread 2 starts b.
- Some other thread 3 finishes a long job, and now a new long job B is available.
- Thread 1 finishes the short job, sees that thread 2 is still busy, and starts looking for other things to do in the meantime.
- Thread 1 picks up job B.
- Thread 2 finishes, but by then thread 1 is stuck in B, and job A stalls.

Note that this is different from #630 because in our case the short jobs are called by the long jobs, and I also don't see how affinity would solve this problem.

I tried building different threadpools in https://github.com/JoeyBF/sseq/tree/bandaid, and while the performance seems to improve, the logs still show that long jobs sometimes stall.

My proposal is to implement priorities in a very naive way. It would be sufficient to have to ability to tag jobs with priority values (say an `i32` defaulting to 0) and to have access to a new `join_priority` function that only allows waiting threads to steal jobs of equal or higher priority relative to the job it just finished

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Priority inversion when executing jobs of heterogeneous length #957

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Priority inversion when executing jobs of heterogeneous length #957

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions