Skip to content

Commit 2b49049

Browse files
Spin until docker is running
This spins indefinitely if docker is down, preventing total experiment failure in that case. For details on why this strategy is chosen, see the comment added.
1 parent 3edd485 commit 2b49049

File tree

1 file changed

+23
-2
lines changed

1 file changed

+23
-2
lines changed

src/runner/mod.rs

Lines changed: 23 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -56,8 +56,29 @@ pub fn run_ex<DB: WriteResults + Sync>(
5656
threads_count: usize,
5757
config: &Config,
5858
) -> Fallible<()> {
59-
if !rustwide::cmd::docker_running(workspace) {
60-
return Err(err_msg("docker is not running"));
59+
// Attempt to spin indefinitely until docker is up. Ideally, we would
60+
// decomission this agent until docker is up, instead of leaving the
61+
// assigned crates to 'hang' until we get our act together. In practice, we
62+
// expect workers to be around most of the time (just sometimes being
63+
// restarted etc.) and so the assigned crates shouldn't hang for long.
64+
//
65+
// If we return an Err(...) from this function, then currently that is
66+
// treated as a hard failure of the underlying experiment, but this error
67+
// has nothing to do with the experiment, so shouldn't be reported as such.
68+
//
69+
// In the future we'll want to *alert* on this error so that a human can
70+
// investigate, but the hope is that in practice docker is just being slow
71+
// or similar and this will fix itself, which currently makes the most sense
72+
// given low human resources. Additionally, it'll be indirectly alerted
73+
// through the worker being "down" according to our progress metrics, since
74+
// jobs won't be completed.
75+
let mut i = 0;
76+
while !rustwide::cmd::docker_running(workspace) {
77+
log::error!(
78+
"docker is not currently up, waiting for it to start (tried {} times)",
79+
i
80+
);
81+
i += 1;
6182
}
6283

6384
info!("computing the tasks graph...");

0 commit comments

Comments
 (0)