Improve CI stability #5022

crusaderky · 2021-07-02T11:45:55Z

Closes #5045
Supersedes #4228

What changes

Drastically reduced CI failure rates
Extensive review of all timeouts
Statistical review of flaky tests
Generate and publish junit reports for later ingestion by your choice of aggregator
On Linux and MacOS, pytest-timeout will now kill the individual offending test instead of the whole suite
pytest-timeout should kick in a lot less frequently now
gen_test and gen_cluster no longer accept timeout=None (which meant letting pytest-timeout kill everything off)
Hopefully increased mamba resiliency to transitory network issues

Known issues

The junit report does not highlight a flaky test that failed multiple times before finally succeeding. It should however be straightforward to write a bot that massively fetches and parses the console logs.
The console log does not show any info besides a laconic "RERUN" when a flaky test fails and then succeeds later on.
On some tests (marked as xfail), you will either have a success on the first go or 10 failures in a row; there is no middle ground. Why this happens completely escapes me. I've already tested that gen_cluster does regenerate from scratch all objects between retries. For other tests, the flaky decorator works as intended and you'll have 1-2 failures before a success.
Occasionally, Mamba fails on Windows with the error message "The process cannot access the file because it is being used by another process". It looks like a straightforward collision in random temporary file names. See upstream RuntimeError: The process cannot access the file because it is being used by another process. mamba-org/mamba#1058
Occasionally, Mamba fails with network errors. Can't say if further condarc tweaking / upstream patches can fix the issue or if the server is actually going silent for several minutes at once. I suspect it's the former, but the logs are unhelpful and the problem doesn't appear frequently enough to be able to properly shake it down.
Occasionally, a whole test suite will get stuck for 5 hours and then get killed. Other times, it gets killed with SIGKILL after a few minutes. Others, the whole docker image goes down. There are no logs whatsoever when this happens and I can see no way to debug it.

crusaderky · 2021-07-05T13:24:52Z

Initial findings

test_progress.py::test_AllProgress_lost_key failed 8 times out of 169, all on Linux
test_client_executor.py::test_map failed 8 times out of 169, all on MacOS
test_adaptive.py::test_target_duration caused pytest-timeout to kill off the test suite, 10 times out of 179 (0 on Linux, 8 on Windows, 2 on MacOS)
test_variable.py::test_race caused pytest-timeout to kill off the test suite, one time out of 170
11 assorted other tests failed 15 times complexively over 169 runs

Statistical analysis

39 out of 180 test suites (22%) failed.
The current chance of a PR to get a green light (all 9 test suites are successful at the same time) is ~7%.

Details

https://drive.google.com/file/d/1ZqAKzU-KM4u7YInytcrd_Xn5ns1etdRa/view?usp=sharing

cc @mrocklin @jrbourbeau

crusaderky · 2021-07-06T23:21:07Z

4 out of 180 runs randomly failed on mamba today, all on Windows.
None failed on the previous round.

crusaderky · 2021-07-07T10:31:17Z

New run results

3 times out of 180, mamba randomly crashed
1 time out of 177, the test suite got completely stuck for 6 hours and was not rescued by pytest-timeout
4 times out of 176, test_adaptive.py::test_target_duration caused pytest-timeout to kill off the whole suite. This was previously thought to be fixed.
4 assorted other tests failed once each over 172 runs. Of these, one had been already observed in the previous run (but was not expected to be fixed).
12 out of 180 test suites (7%) failed.
The current chance of a PR to get a green light (all 9 test suites are successful at the same time) is ~52%.

crusaderky · 2021-07-07T15:58:20Z

continuous_integration/condarc

+remote_backoff_factor: 20
+remote_connect_timeout_secs: 20.0
+remote_max_retries: 10
+remote_read_timeout_secs: 60.0


In case of random failure, change from 3 retries over 3 seconds to 10 retries over 200 seconds.
Also double the connection timeout.

crusaderky · 2021-07-07T16:05:41Z

distributed/tests/test_actor.py

@@ -488,7 +488,7 @@ def check(counter, blanks):
    start = time()
    while a.data or b.data:
        await asyncio.sleep(0.01)
-        assert time() < start + 30
+        assert time() < start + 10, (dict(a.data), dict(b.data))


This test randomly failed twice over 360 runs (the Actor was still in Worker.data after 10 seconds). Why this happened is out of scope here.

What is relevant to this PR, however, is that both times the flaky decorator reran the test 10 times, and it failed 10 times in a row. Why this happens completely baffles me. I've already tested that c, s, a, b are regenerated from scratch at every rerun.

See log: https://github.com/dask/distributed/pull/5022/checks?check_run_id=3003465962

Might be related to #4937

mrocklin · 2021-07-12T19:43:05Z

That's a lot of green check marks

crusaderky · 2021-07-13T09:02:40Z

Stress tests completed with 6 failures out of 360 runs:
https://github.com/dask/distributed/actions/runs/1022407845
https://github.com/crusaderky/distributed/actions/runs/1022407764

Detailed test analysis:
https://drive.google.com/file/d/1ZqAKzU-KM4u7YInytcrd_Xn5ns1etdRa/view?usp=sharing

This PR is now ready for review and merge. Please read updated description at the top.
CC @mrocklin @fjetter @ian-r-rose @jrbourbeau

fjetter

I'm a bit worried about the retry-everything strategy for a few reasons. There is the overall test runtime, in particular for actually failing logic and their impact on the overall repo. As already commented, this may be mediated by global GH actions timeouts but the issue will still be there.

What worries me a bit more is that this effectively masks all our issues. I really would prefer us sticking with individually marked flaky reruns where we can make a judgement call whether or not this test is safe to retry or if a failure might be related to an underlying issue.

Visibility is a big issue here. If there was a bot or something reporting on the number of retries, that might be a compromise.

Do you have numbers about how bad it is with and w/out this global retry to get a feeling about what we're talking here?

fjetter · 2021-07-13T12:17:11Z

.github/workflows/tests.yaml

-          pytest distributed -m "not avoid_ci" --runslow
+
+          pytest distributed -m "not avoid_ci" --runslow \
+            --reruns 10 --reruns-delay 10 \


This concerns me a little bit. Consider an actually broken CI run which fails every test. That would let a single job run for 10hr (assuming one full test suite runs about an hour). At the very least we should then shorten the GH actions timeout to not block CI for the entire org if something like this happens, see https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

I would much rather stick to the selective rerunning.

Are these reruns marked in the reports generated here or will we be entirely blind if this happens? (other than parsing logs, ofc)

Default GH action timeout is 5h. I've now shortened it to 3h. Hoping it's enough for when Windows decides to be slow...

Are these reruns marked in the reports generated here or will we be entirely blind if this happens?

Alas, you have to parse the logs. It's the same as before; it just applies to all tests instead of an ever-increasing selection.

fjetter · 2021-07-13T12:18:54Z

distributed/deploy/tests/test_spec_cluster.py

-@pytest.mark.asyncio
-async def test_restart(cleanup):
-    # Regression test for https://github.com/dask/distributed/issues/3062
+@gen_test(timeout=60)


slow marker should probably still be here, shouldn't it? increased timeout sounds like slow

increased timeout is just for exceptional cases on Windows, where the CI frequently finds itself gasping for air. On a typical laptop, this test completes in 1.5s.

fjetter · 2021-07-13T12:20:41Z

distributed/nanny.py

@@ -384,7 +384,7 @@ async def instantiate(self, comm=None) -> Status:
                raise
        return result

-    async def restart(self, comm=None, timeout=2, executor_wait=True):
+    async def restart(self, comm=None, timeout=20, executor_wait=True):


While I doubt this will be critical, this changes the behaviour for users. I would expect restart to not be called very frequently in CI. Any reason to increase this in the code itself and not the test?

This timeout is actually ignored by users, as in a typical non-test situation you get

distributed.comms.timeouts.connect (was 10s; now raised to 20s) [EDIT: 30s]
-> Client.restart without arguments
-> Scheduler.restart
-> multiply by 0.8 in Scheduler.restart before passing it down to Nanny
-> Nanny.restart

fjetter · 2021-07-13T12:21:32Z

distributed/distributed.yaml

@@ -176,7 +176,7 @@ distributed:
      threads: 0    # Threads to use. 0 for single-threaded, -1 to infer from cpu count.

    timeouts:
-      connect: 10s          # time before connecting fails
+      connect: 20s          # time before connecting fails


There is a PR open for ages now to increase this. #4228

Bumped to 30s

crusaderky · 2021-07-13T15:00:29Z

Do you have numbers about how bad it is with and w/out this global retry to get a feeling about what we're talking here?

click on "Details" on any of the tests above
click on top-right gear icon
click on "Download log archive"
unpack zip file
grep -h RERUN */12_Test.txt | awk '{print $2}' | sort | uniq -c

Aggregated output from the last stress test (360 runs):

      3 distributed/cli/tests/test_dask_scheduler.py::test_interface
      2 distributed/cli/tests/test_dask_scheduler.py::test_pid_file
      1 distributed/cli/tests/test_dask_spec.py::test_file
      1 distributed/cli/tests/test_dask_worker.py::test_nanny_worker_port_range
      1 distributed/cli/tests/test_dask_worker.py::test_nprocs_negative
      1 distributed/cli/tests/test_tls_cli.py::test_use_config_file
      1 distributed/dashboard/tests/test_scheduler_bokeh.py::test_task_stream
      2 distributed/deploy/tests/test_adaptive.py::test_adaptive_local_cluster_multi_workers
     12 distributed/deploy/tests/test_adaptive.py::test_target_duration
      1 distributed/deploy/tests/test_slow_adaptive.py::test_startup
      3 distributed/diagnostics/tests/test_progress.py::test_AllProgress
      9 distributed/diagnostics/tests/test_progress.py::test_AllProgress_lost_key
     79 distributed/tests/test_actor.py::test_compute
     32 distributed/tests/test_client_executor.py::test_map
      1 distributed/tests/test_client_executor.py::test_shutdown
      3 distributed/tests/test_client.py::test_client_timeout
      1 distributed/tests/test_client.py::test_performance_report
     12 distributed/tests/test_client.py::test_profile
      1 distributed/tests/test_client.py::test_secede_balances
      1 distributed/tests/test_core.py::test_ticks
      5 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation
      1 distributed/tests/test_failed_workers.py::test_forgotten_futures_dont_clean_up_new_futures
      8 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
      1 distributed/tests/test_ipython.py::test_start_ipython_scheduler
      2 distributed/tests/test_nanny.py::test_num_fds
     13 distributed/tests/test_scheduler.py::test_memory
      1 distributed/tests/test_scheduler.py::test_rebalance_least_recently_inserted_sender_min
      1 distributed/tests/test_semaphore.py::test_oversubscribing_leases
     77 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
      2 distributed/tests/test_variable.py::test_race
     10 distributed/tests/test_worker.py::test_statistical_profiling
      1 distributed/tests/test_worker.py::test_worker_fds

Some of the above are known offenders. Others I've never seen before; this is after several days of non-stop stress testing without the blanket rerun flag!

crusaderky · 2021-07-13T15:06:23Z

Breakdown of retries by OS:

# Linux
$ grep -h RERUN logs_*/*ubuntu*/12_Test.txt | awk '{print $2}' | sort | uniq -c
      3 distributed/deploy/tests/test_adaptive.py::test_target_duration
      3 distributed/diagnostics/tests/test_progress.py::test_AllProgress
      9 distributed/diagnostics/tests/test_progress.py::test_AllProgress_lost_key
     17 distributed/tests/test_actor.py::test_compute
      5 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
     13 distributed/tests/test_scheduler.py::test_memory
      8 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
      1 distributed/tests/test_variable.py::test_race
      1 distributed/tests/test_worker.py::test_worker_fds

# Windows
$ grep -h RERUN logs_*/*windows*/12_Test.txt | awk '{print $2}' | sort | uniq -c
     13 distributed/tests/test_actor.py::test_compute
      1 distributed/tests/test_client.py::test_performance_report
     12 distributed/tests/test_client.py::test_profile
      1 distributed/tests/test_core.py::test_ticks
      1 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
      1 distributed/tests/test_semaphore.py::test_oversubscribing_leases
     35 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
      1 distributed/tests/test_variable.py::test_race
      3 distributed/tests/test_worker.py::test_statistical_profiling

# MacOS
$ grep -h RERUN logs_*/*macos*/12_Test.txt | awk '{print $2}' | sort | uniq -c
      3 distributed/cli/tests/test_dask_scheduler.py::test_interface
      2 distributed/cli/tests/test_dask_scheduler.py::test_pid_file
      1 distributed/cli/tests/test_dask_spec.py::test_file
      1 distributed/cli/tests/test_dask_worker.py::test_nanny_worker_port_range
      1 distributed/cli/tests/test_dask_worker.py::test_nprocs_negative
      1 distributed/cli/tests/test_tls_cli.py::test_use_config_file
      1 distributed/dashboard/tests/test_scheduler_bokeh.py::test_task_stream
      2 distributed/deploy/tests/test_adaptive.py::test_adaptive_local_cluster_multi_workers
      9 distributed/deploy/tests/test_adaptive.py::test_target_duration
      1 distributed/deploy/tests/test_slow_adaptive.py::test_startup
     49 distributed/tests/test_actor.py::test_compute
     32 distributed/tests/test_client_executor.py::test_map
      1 distributed/tests/test_client_executor.py::test_shutdown
      3 distributed/tests/test_client.py::test_client_timeout
      1 distributed/tests/test_client.py::test_secede_balances
      5 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation
      1 distributed/tests/test_failed_workers.py::test_forgotten_futures_dont_clean_up_new_futures
      2 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
      1 distributed/tests/test_ipython.py::test_start_ipython_scheduler
      2 distributed/tests/test_nanny.py::test_num_fds
      1 distributed/tests/test_scheduler.py::test_rebalance_least_recently_inserted_sender_min
     34 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
      7 distributed/tests/test_worker.py::test_statistical_profiling

fjetter

Great work!

crusaderky · 2021-07-13T22:44:56Z

Marked as flaky:

         12 distributed/deploy/tests/test_adaptive.py::test_target_duration
          3 distributed/diagnostics/tests/test_progress.py::test_AllProgress
          9 distributed/diagnostics/tests/test_progress.py::test_AllProgress_lost_key
         79 distributed/tests/test_actor.py::test_compute
         32 distributed/tests/test_client_executor.py::test_map
         12 distributed/tests/test_client.py::test_profile
          5 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation
          8 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
         13 distributed/tests/test_scheduler.py::test_memory
         77 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
         10 distributed/tests/test_worker.py::test_statistical_profiling

Tweaked; hopefully they should no longer fail:

      2 distributed/cli/tests/test_dask_scheduler.py::test_pid_file
      3 distributed/cli/tests/test_dask_scheduler.py::test_interface
      3 distributed/tests/test_client.py::test_client_timeout

NOT marked as flaky. The random failures observed in the past WILL show up again, although there is no evidence to support that they will do so on the same test or somewhere else:

      1 distributed/cli/tests/test_dask_spec.py::test_file
      1 distributed/cli/tests/test_dask_worker.py::test_nanny_worker_port_range
      1 distributed/cli/tests/test_dask_worker.py::test_nprocs_negative
      1 distributed/cli/tests/test_tls_cli.py::test_use_config_file
      1 distributed/dashboard/tests/test_scheduler_bokeh.py::test_task_stream
      2 distributed/deploy/tests/test_adaptive.py::test_adaptive_local_cluster_multi_workers
      1 distributed/deploy/tests/test_slow_adaptive.py::test_startup
      1 distributed/tests/test_client_executor.py::test_shutdown
      1 distributed/tests/test_client.py::test_performance_report
      1 distributed/tests/test_client.py::test_secede_balances
      1 distributed/tests/test_core.py::test_ticks
      1 distributed/tests/test_failed_workers.py::test_forgotten_futures_dont_clean_up_new_futures
      1 distributed/tests/test_ipython.py::test_start_ipython_scheduler
      2 distributed/tests/test_nanny.py::test_num_fds
      1 distributed/tests/test_scheduler.py::test_rebalance_least_recently_inserted_sender_min
      1 distributed/tests/test_semaphore.py::test_oversubscribing_leases
      2 distributed/tests/test_variable.py::test_race
      1 distributed/tests/test_worker.py::test_worker_fds

CI stress test

4af54cb

jrbourbeau mentioned this pull request Jul 2, 2021

Temporarily xfail test_daily_stock dask/dask#7858

Merged

crusaderky added 4 commits July 6, 2021 13:51

Tentative fix

55f7c88

Merge remote-tracking branch 'upstream/main' into ci_stresstest

b11c68b

polish

4430ea9

Remove @pytest.mark.stress

33bbf61

crusaderky added 4 commits July 7, 2021 16:31

Fixes

b636b4c

black

6cb1f58

Merge remote-tracking branch 'upstream/main' into ci_stresstest

3cead19

condarc

36a8d8b

crusaderky commented Jul 7, 2021

View reviewed changes

crusaderky added 15 commits July 7, 2021 17:56

Remove redundant timeouts

adc5102

fix

b0cc28d

fix

64a6f3f

reenable stress

a998239

Global reruns

2f6f93c

fixes

26e92fe

Merge remote-tracking branch 'upstream/main' into ci_stresstest

0bbff51

Revisit timeouts

3f4eba5

Merge remote-tracking branch 'origin/main' into ci_stresstest

017b7b0

No SIGALRM available on windows

d792eb6

Fix MacOS

690a26a

stress test again

e78b214

Windows fixes

fe541b2

Windows fixes

e0360ae

Self-review

7ae7dde

fixes

f649de0

This was referenced Jul 12, 2021

Test timeout test_target_duration #4859

Open

Flaky test_workspace_concurrency_intense #4766

Closed

Workers have started leaking threads #5049

Open

crusaderky added 2 commits July 13, 2021 01:33

Disable stress tests

1d85fbb

Merge branch 'main' into ci_stresstest

940919d

crusaderky changed the title ~~[DO NOT MERGE] Improve CI stability~~ Improve CI stability Jul 13, 2021

crusaderky marked this pull request as ready for review July 13, 2021 09:02

fjetter reviewed Jul 13, 2021

View reviewed changes

crusaderky mentioned this pull request Jul 13, 2021

Increase default for connect timeout #4228

Closed

crusaderky added 2 commits July 13, 2021 15:42

Bump timeouts from 10/20s to 30s (see dask#4228)

b088098

Reduce GH timeout from 5 to 3 hours

909845d

crusaderky added 2 commits July 13, 2021 17:41

Individual flaky tests

3688222

skipif review

6bd94a3

fjetter approved these changes Jul 13, 2021

View reviewed changes

test_get_ip_interface

9fb0c23

crusaderky closed this Jul 14, 2021

crusaderky reopened this Jul 14, 2021

test_cancellation

673f7ac

fjetter merged commit 67f57bc into dask:main Jul 14, 2021

crusaderky deleted the ci_stresstest branch July 14, 2021 09:43

crusaderky mentioned this pull request Jul 14, 2021

test_worker_heartbeat_after_cancel hangs on MacOS 3.7/3.8 #5056

Closed

keewis mentioned this pull request Jul 15, 2021

⚠️ Nightly upstream-dev CI failed ⚠️ pydata/xarray#5600

Closed

crusaderky mentioned this pull request Jan 24, 2022

Fix flaky test_workspace_concurrency #5690

Merged

fjetter mentioned this pull request Feb 9, 2022

Increase robustness to TimeoutError during connect #5096

Merged

Uh oh!

Improve CI stability #5022

Improve CI stability #5022

Uh oh!

Conversation

crusaderky commented Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes

Known issues

Uh oh!

crusaderky commented Jul 5, 2021

Initial findings

Statistical analysis

Details

Uh oh!

crusaderky commented Jul 6, 2021

Uh oh!

crusaderky commented Jul 7, 2021

New run results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jul 12, 2021

Uh oh!

crusaderky commented Jul 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky Jul 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Jul 13, 2021

Uh oh!

crusaderky commented Jul 13, 2021

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Jul 13, 2021

Uh oh!

Uh oh!

crusaderky commented Jul 2, 2021 •

edited

Loading

crusaderky commented Jul 13, 2021 •

edited

Loading

crusaderky Jul 13, 2021 •

edited

Loading