-
-
Notifications
You must be signed in to change notification settings - Fork 733
Improve CI stability #5022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve CI stability #5022
Conversation
Initial findings
Statistical analysis
Detailshttps://drive.google.com/file/d/1ZqAKzU-KM4u7YInytcrd_Xn5ns1etdRa/view?usp=sharing |
4 out of 180 runs randomly failed on mamba today, all on Windows. |
New run results
|
remote_backoff_factor: 20 | ||
remote_connect_timeout_secs: 20.0 | ||
remote_max_retries: 10 | ||
remote_read_timeout_secs: 60.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of random failure, change from 3 retries over 3 seconds to 10 retries over 200 seconds.
Also double the connection timeout.
distributed/tests/test_actor.py
Outdated
@@ -488,7 +488,7 @@ def check(counter, blanks): | |||
start = time() | |||
while a.data or b.data: | |||
await asyncio.sleep(0.01) | |||
assert time() < start + 30 | |||
assert time() < start + 10, (dict(a.data), dict(b.data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test randomly failed twice over 360 runs (the Actor was still in Worker.data after 10 seconds). Why this happened is out of scope here.
What is relevant to this PR, however, is that both times the flaky decorator reran the test 10 times, and it failed 10 times in a row. Why this happens completely baffles me. I've already tested that c, s, a, b are regenerated from scratch at every rerun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be related to #4937
That's a lot of green check marks |
Stress tests completed with 6 failures out of 360 runs: Detailed test analysis: This PR is now ready for review and merge. Please read updated description at the top. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about the retry-everything strategy for a few reasons. There is the overall test runtime, in particular for actually failing logic and their impact on the overall repo. As already commented, this may be mediated by global GH actions timeouts but the issue will still be there.
What worries me a bit more is that this effectively masks all our issues. I really would prefer us sticking with individually marked flaky reruns where we can make a judgement call whether or not this test is safe to retry or if a failure might be related to an underlying issue.
Visibility is a big issue here. If there was a bot or something reporting on the number of retries, that might be a compromise.
Do you have numbers about how bad it is with and w/out this global retry to get a feeling about what we're talking here?
.github/workflows/tests.yaml
Outdated
pytest distributed -m "not avoid_ci" --runslow | ||
|
||
pytest distributed -m "not avoid_ci" --runslow \ | ||
--reruns 10 --reruns-delay 10 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This concerns me a little bit. Consider an actually broken CI run which fails every test. That would let a single job run for 10hr (assuming one full test suite runs about an hour). At the very least we should then shorten the GH actions timeout to not block CI for the entire org if something like this happens, see https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes
I would much rather stick to the selective rerunning.
Are these reruns marked in the reports generated here or will we be entirely blind if this happens? (other than parsing logs, ofc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default GH action timeout is 5h. I've now shortened it to 3h. Hoping it's enough for when Windows decides to be slow...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these reruns marked in the reports generated here or will we be entirely blind if this happens?
Alas, you have to parse the logs. It's the same as before; it just applies to all tests instead of an ever-increasing selection.
@pytest.mark.asyncio | ||
async def test_restart(cleanup): | ||
# Regression test for https://github.com/dask/distributed/issues/3062 | ||
@gen_test(timeout=60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slow marker should probably still be here, shouldn't it? increased timeout sounds like slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
increased timeout is just for exceptional cases on Windows, where the CI frequently finds itself gasping for air. On a typical laptop, this test completes in 1.5s.
distributed/nanny.py
Outdated
@@ -384,7 +384,7 @@ async def instantiate(self, comm=None) -> Status: | |||
raise | |||
return result | |||
|
|||
async def restart(self, comm=None, timeout=2, executor_wait=True): | |||
async def restart(self, comm=None, timeout=20, executor_wait=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I doubt this will be critical, this changes the behaviour for users. I would expect restart to not be called very frequently in CI. Any reason to increase this in the code itself and not the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout is actually ignored by users, as in a typical non-test situation you get
distributed.comms.timeouts.connect (was 10s; now raised to 20s) [EDIT: 30s]
-> Client.restart without arguments
-> Scheduler.restart
-> multiply by 0.8 in Scheduler.restart before passing it down to Nanny
-> Nanny.restart
distributed/distributed.yaml
Outdated
@@ -176,7 +176,7 @@ distributed: | |||
threads: 0 # Threads to use. 0 for single-threaded, -1 to infer from cpu count. | |||
|
|||
timeouts: | |||
connect: 10s # time before connecting fails | |||
connect: 20s # time before connecting fails |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a PR open for ages now to increase this. #4228
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bumped to 30s
Aggregated output from the last stress test (360 runs): 3 distributed/cli/tests/test_dask_scheduler.py::test_interface
2 distributed/cli/tests/test_dask_scheduler.py::test_pid_file
1 distributed/cli/tests/test_dask_spec.py::test_file
1 distributed/cli/tests/test_dask_worker.py::test_nanny_worker_port_range
1 distributed/cli/tests/test_dask_worker.py::test_nprocs_negative
1 distributed/cli/tests/test_tls_cli.py::test_use_config_file
1 distributed/dashboard/tests/test_scheduler_bokeh.py::test_task_stream
2 distributed/deploy/tests/test_adaptive.py::test_adaptive_local_cluster_multi_workers
12 distributed/deploy/tests/test_adaptive.py::test_target_duration
1 distributed/deploy/tests/test_slow_adaptive.py::test_startup
3 distributed/diagnostics/tests/test_progress.py::test_AllProgress
9 distributed/diagnostics/tests/test_progress.py::test_AllProgress_lost_key
79 distributed/tests/test_actor.py::test_compute
32 distributed/tests/test_client_executor.py::test_map
1 distributed/tests/test_client_executor.py::test_shutdown
3 distributed/tests/test_client.py::test_client_timeout
1 distributed/tests/test_client.py::test_performance_report
12 distributed/tests/test_client.py::test_profile
1 distributed/tests/test_client.py::test_secede_balances
1 distributed/tests/test_core.py::test_ticks
5 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation
1 distributed/tests/test_failed_workers.py::test_forgotten_futures_dont_clean_up_new_futures
8 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
1 distributed/tests/test_ipython.py::test_start_ipython_scheduler
2 distributed/tests/test_nanny.py::test_num_fds
13 distributed/tests/test_scheduler.py::test_memory
1 distributed/tests/test_scheduler.py::test_rebalance_least_recently_inserted_sender_min
1 distributed/tests/test_semaphore.py::test_oversubscribing_leases
77 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
2 distributed/tests/test_variable.py::test_race
10 distributed/tests/test_worker.py::test_statistical_profiling
1 distributed/tests/test_worker.py::test_worker_fds Some of the above are known offenders. Others I've never seen before; this is after several days of non-stop stress testing without the blanket rerun flag! |
Breakdown of retries by OS: # Linux
$ grep -h RERUN logs_*/*ubuntu*/12_Test.txt | awk '{print $2}' | sort | uniq -c
3 distributed/deploy/tests/test_adaptive.py::test_target_duration
3 distributed/diagnostics/tests/test_progress.py::test_AllProgress
9 distributed/diagnostics/tests/test_progress.py::test_AllProgress_lost_key
17 distributed/tests/test_actor.py::test_compute
5 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
13 distributed/tests/test_scheduler.py::test_memory
8 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
1 distributed/tests/test_variable.py::test_race
1 distributed/tests/test_worker.py::test_worker_fds
# Windows
$ grep -h RERUN logs_*/*windows*/12_Test.txt | awk '{print $2}' | sort | uniq -c
13 distributed/tests/test_actor.py::test_compute
1 distributed/tests/test_client.py::test_performance_report
12 distributed/tests/test_client.py::test_profile
1 distributed/tests/test_core.py::test_ticks
1 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
1 distributed/tests/test_semaphore.py::test_oversubscribing_leases
35 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
1 distributed/tests/test_variable.py::test_race
3 distributed/tests/test_worker.py::test_statistical_profiling
# MacOS
$ grep -h RERUN logs_*/*macos*/12_Test.txt | awk '{print $2}' | sort | uniq -c
3 distributed/cli/tests/test_dask_scheduler.py::test_interface
2 distributed/cli/tests/test_dask_scheduler.py::test_pid_file
1 distributed/cli/tests/test_dask_spec.py::test_file
1 distributed/cli/tests/test_dask_worker.py::test_nanny_worker_port_range
1 distributed/cli/tests/test_dask_worker.py::test_nprocs_negative
1 distributed/cli/tests/test_tls_cli.py::test_use_config_file
1 distributed/dashboard/tests/test_scheduler_bokeh.py::test_task_stream
2 distributed/deploy/tests/test_adaptive.py::test_adaptive_local_cluster_multi_workers
9 distributed/deploy/tests/test_adaptive.py::test_target_duration
1 distributed/deploy/tests/test_slow_adaptive.py::test_startup
49 distributed/tests/test_actor.py::test_compute
32 distributed/tests/test_client_executor.py::test_map
1 distributed/tests/test_client_executor.py::test_shutdown
3 distributed/tests/test_client.py::test_client_timeout
1 distributed/tests/test_client.py::test_secede_balances
5 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation
1 distributed/tests/test_failed_workers.py::test_forgotten_futures_dont_clean_up_new_futures
2 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
1 distributed/tests/test_ipython.py::test_start_ipython_scheduler
2 distributed/tests/test_nanny.py::test_num_fds
1 distributed/tests/test_scheduler.py::test_rebalance_least_recently_inserted_sender_min
34 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
7 distributed/tests/test_worker.py::test_statistical_profiling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
Marked as flaky:
Tweaked; hopefully they should no longer fail:
NOT marked as flaky. The random failures observed in the past WILL show up again, although there is no evidence to support that they will do so on the same test or somewhere else:
|
Closes #5045
Supersedes #4228
What changes
gen_test
andgen_cluster
no longer accepttimeout=None
(which meant letting pytest-timeout kill everything off)Known issues
gen_cluster
does regenerate from scratch all objects between retries. For other tests, the flaky decorator works as intended and you'll have 1-2 failures before a success.