Skip to content

Improve CI stability #5022

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 43 commits into from
Jul 14, 2021
Merged

Improve CI stability #5022

merged 43 commits into from
Jul 14, 2021

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Jul 2, 2021

Closes #5045
Supersedes #4228

What changes

  • Drastically reduced CI failure rates
  • Extensive review of all timeouts
  • Statistical review of flaky tests
  • Generate and publish junit reports for later ingestion by your choice of aggregator
  • On Linux and MacOS, pytest-timeout will now kill the individual offending test instead of the whole suite
  • pytest-timeout should kick in a lot less frequently now
  • gen_test and gen_cluster no longer accept timeout=None (which meant letting pytest-timeout kill everything off)
  • Hopefully increased mamba resiliency to transitory network issues

Known issues

  • The junit report does not highlight a flaky test that failed multiple times before finally succeeding. It should however be straightforward to write a bot that massively fetches and parses the console logs.
  • The console log does not show any info besides a laconic "RERUN" when a flaky test fails and then succeeds later on.
  • On some tests (marked as xfail), you will either have a success on the first go or 10 failures in a row; there is no middle ground. Why this happens completely escapes me. I've already tested that gen_cluster does regenerate from scratch all objects between retries. For other tests, the flaky decorator works as intended and you'll have 1-2 failures before a success.
  • Occasionally, Mamba fails on Windows with the error message "The process cannot access the file because it is being used by another process". It looks like a straightforward collision in random temporary file names. See upstream RuntimeError: The process cannot access the file because it is being used by another process. mamba-org/mamba#1058
  • Occasionally, Mamba fails with network errors. Can't say if further condarc tweaking / upstream patches can fix the issue or if the server is actually going silent for several minutes at once. I suspect it's the former, but the logs are unhelpful and the problem doesn't appear frequently enough to be able to properly shake it down.
  • Occasionally, a whole test suite will get stuck for 5 hours and then get killed. Other times, it gets killed with SIGKILL after a few minutes. Others, the whole docker image goes down. There are no logs whatsoever when this happens and I can see no way to debug it.

@crusaderky
Copy link
Collaborator Author

Initial findings

  • test_progress.py::test_AllProgress_lost_key failed 8 times out of 169, all on Linux
  • test_client_executor.py::test_map failed 8 times out of 169, all on MacOS
  • test_adaptive.py::test_target_duration caused pytest-timeout to kill off the test suite, 10 times out of 179 (0 on Linux, 8 on Windows, 2 on MacOS)
  • test_variable.py::test_race caused pytest-timeout to kill off the test suite, one time out of 170
  • 11 assorted other tests failed 15 times complexively over 169 runs

Statistical analysis

  • 39 out of 180 test suites (22%) failed.
  • The current chance of a PR to get a green light (all 9 test suites are successful at the same time) is ~7%.

Details

https://drive.google.com/file/d/1ZqAKzU-KM4u7YInytcrd_Xn5ns1etdRa/view?usp=sharing

cc @mrocklin @jrbourbeau

@crusaderky
Copy link
Collaborator Author

4 out of 180 runs randomly failed on mamba today, all on Windows.
None failed on the previous round.

@crusaderky
Copy link
Collaborator Author

New run results

  • 3 times out of 180, mamba randomly crashed
  • 1 time out of 177, the test suite got completely stuck for 6 hours and was not rescued by pytest-timeout
  • 4 times out of 176, test_adaptive.py::test_target_duration caused pytest-timeout to kill off the whole suite. This was previously thought to be fixed.
  • 4 assorted other tests failed once each over 172 runs. Of these, one had been already observed in the previous run (but was not expected to be fixed).
  • 12 out of 180 test suites (7%) failed.
  • The current chance of a PR to get a green light (all 9 test suites are successful at the same time) is ~52%.

remote_backoff_factor: 20
remote_connect_timeout_secs: 20.0
remote_max_retries: 10
remote_read_timeout_secs: 60.0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of random failure, change from 3 retries over 3 seconds to 10 retries over 200 seconds.
Also double the connection timeout.

@@ -488,7 +488,7 @@ def check(counter, blanks):
start = time()
while a.data or b.data:
await asyncio.sleep(0.01)
assert time() < start + 30
assert time() < start + 10, (dict(a.data), dict(b.data))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test randomly failed twice over 360 runs (the Actor was still in Worker.data after 10 seconds). Why this happened is out of scope here.

What is relevant to this PR, however, is that both times the flaky decorator reran the test 10 times, and it failed 10 times in a row. Why this happens completely baffles me. I've already tested that c, s, a, b are regenerated from scratch at every rerun.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be related to #4937

@mrocklin
Copy link
Member

That's a lot of green check marks

@crusaderky
Copy link
Collaborator Author

crusaderky commented Jul 13, 2021

Stress tests completed with 6 failures out of 360 runs:
https://github.com/dask/distributed/actions/runs/1022407845
https://github.com/crusaderky/distributed/actions/runs/1022407764

Detailed test analysis:
https://drive.google.com/file/d/1ZqAKzU-KM4u7YInytcrd_Xn5ns1etdRa/view?usp=sharing

This PR is now ready for review and merge. Please read updated description at the top.
CC @mrocklin @fjetter @ian-r-rose @jrbourbeau

@crusaderky crusaderky changed the title [DO NOT MERGE] Improve CI stability Improve CI stability Jul 13, 2021
@crusaderky crusaderky marked this pull request as ready for review July 13, 2021 09:02
Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about the retry-everything strategy for a few reasons. There is the overall test runtime, in particular for actually failing logic and their impact on the overall repo. As already commented, this may be mediated by global GH actions timeouts but the issue will still be there.

What worries me a bit more is that this effectively masks all our issues. I really would prefer us sticking with individually marked flaky reruns where we can make a judgement call whether or not this test is safe to retry or if a failure might be related to an underlying issue.

Visibility is a big issue here. If there was a bot or something reporting on the number of retries, that might be a compromise.

Do you have numbers about how bad it is with and w/out this global retry to get a feeling about what we're talking here?

pytest distributed -m "not avoid_ci" --runslow

pytest distributed -m "not avoid_ci" --runslow \
--reruns 10 --reruns-delay 10 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concerns me a little bit. Consider an actually broken CI run which fails every test. That would let a single job run for 10hr (assuming one full test suite runs about an hour). At the very least we should then shorten the GH actions timeout to not block CI for the entire org if something like this happens, see https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes

I would much rather stick to the selective rerunning.

Are these reruns marked in the reports generated here or will we be entirely blind if this happens? (other than parsing logs, ofc)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default GH action timeout is 5h. I've now shortened it to 3h. Hoping it's enough for when Windows decides to be slow...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these reruns marked in the reports generated here or will we be entirely blind if this happens?

Alas, you have to parse the logs. It's the same as before; it just applies to all tests instead of an ever-increasing selection.

@pytest.mark.asyncio
async def test_restart(cleanup):
# Regression test for https://github.com/dask/distributed/issues/3062
@gen_test(timeout=60)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slow marker should probably still be here, shouldn't it? increased timeout sounds like slow

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increased timeout is just for exceptional cases on Windows, where the CI frequently finds itself gasping for air. On a typical laptop, this test completes in 1.5s.

@@ -384,7 +384,7 @@ async def instantiate(self, comm=None) -> Status:
raise
return result

async def restart(self, comm=None, timeout=2, executor_wait=True):
async def restart(self, comm=None, timeout=20, executor_wait=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I doubt this will be critical, this changes the behaviour for users. I would expect restart to not be called very frequently in CI. Any reason to increase this in the code itself and not the test?

Copy link
Collaborator Author

@crusaderky crusaderky Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout is actually ignored by users, as in a typical non-test situation you get

distributed.comms.timeouts.connect (was 10s; now raised to 20s) [EDIT: 30s]
-> Client.restart without arguments
-> Scheduler.restart
-> multiply by 0.8 in Scheduler.restart before passing it down to Nanny
-> Nanny.restart

@@ -176,7 +176,7 @@ distributed:
threads: 0 # Threads to use. 0 for single-threaded, -1 to infer from cpu count.

timeouts:
connect: 10s # time before connecting fails
connect: 20s # time before connecting fails
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a PR open for ages now to increase this. #4228

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumped to 30s

@crusaderky
Copy link
Collaborator Author

Do you have numbers about how bad it is with and w/out this global retry to get a feeling about what we're talking here?

  1. click on "Details" on any of the tests above
  2. click on top-right gear icon
  3. click on "Download log archive"
  4. unpack zip file
  5. grep -h RERUN */12_Test.txt | awk '{print $2}' | sort | uniq -c

Aggregated output from the last stress test (360 runs):

      3 distributed/cli/tests/test_dask_scheduler.py::test_interface
      2 distributed/cli/tests/test_dask_scheduler.py::test_pid_file
      1 distributed/cli/tests/test_dask_spec.py::test_file
      1 distributed/cli/tests/test_dask_worker.py::test_nanny_worker_port_range
      1 distributed/cli/tests/test_dask_worker.py::test_nprocs_negative
      1 distributed/cli/tests/test_tls_cli.py::test_use_config_file
      1 distributed/dashboard/tests/test_scheduler_bokeh.py::test_task_stream
      2 distributed/deploy/tests/test_adaptive.py::test_adaptive_local_cluster_multi_workers
     12 distributed/deploy/tests/test_adaptive.py::test_target_duration
      1 distributed/deploy/tests/test_slow_adaptive.py::test_startup
      3 distributed/diagnostics/tests/test_progress.py::test_AllProgress
      9 distributed/diagnostics/tests/test_progress.py::test_AllProgress_lost_key
     79 distributed/tests/test_actor.py::test_compute
     32 distributed/tests/test_client_executor.py::test_map
      1 distributed/tests/test_client_executor.py::test_shutdown
      3 distributed/tests/test_client.py::test_client_timeout
      1 distributed/tests/test_client.py::test_performance_report
     12 distributed/tests/test_client.py::test_profile
      1 distributed/tests/test_client.py::test_secede_balances
      1 distributed/tests/test_core.py::test_ticks
      5 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation
      1 distributed/tests/test_failed_workers.py::test_forgotten_futures_dont_clean_up_new_futures
      8 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
      1 distributed/tests/test_ipython.py::test_start_ipython_scheduler
      2 distributed/tests/test_nanny.py::test_num_fds
     13 distributed/tests/test_scheduler.py::test_memory
      1 distributed/tests/test_scheduler.py::test_rebalance_least_recently_inserted_sender_min
      1 distributed/tests/test_semaphore.py::test_oversubscribing_leases
     77 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
      2 distributed/tests/test_variable.py::test_race
     10 distributed/tests/test_worker.py::test_statistical_profiling
      1 distributed/tests/test_worker.py::test_worker_fds

Some of the above are known offenders. Others I've never seen before; this is after several days of non-stop stress testing without the blanket rerun flag!

@crusaderky
Copy link
Collaborator Author

Breakdown of retries by OS:

# Linux
$ grep -h RERUN logs_*/*ubuntu*/12_Test.txt | awk '{print $2}' | sort | uniq -c
      3 distributed/deploy/tests/test_adaptive.py::test_target_duration
      3 distributed/diagnostics/tests/test_progress.py::test_AllProgress
      9 distributed/diagnostics/tests/test_progress.py::test_AllProgress_lost_key
     17 distributed/tests/test_actor.py::test_compute
      5 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
     13 distributed/tests/test_scheduler.py::test_memory
      8 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
      1 distributed/tests/test_variable.py::test_race
      1 distributed/tests/test_worker.py::test_worker_fds

# Windows
$ grep -h RERUN logs_*/*windows*/12_Test.txt | awk '{print $2}' | sort | uniq -c
     13 distributed/tests/test_actor.py::test_compute
      1 distributed/tests/test_client.py::test_performance_report
     12 distributed/tests/test_client.py::test_profile
      1 distributed/tests/test_core.py::test_ticks
      1 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
      1 distributed/tests/test_semaphore.py::test_oversubscribing_leases
     35 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
      1 distributed/tests/test_variable.py::test_race
      3 distributed/tests/test_worker.py::test_statistical_profiling

# MacOS
$ grep -h RERUN logs_*/*macos*/12_Test.txt | awk '{print $2}' | sort | uniq -c
      3 distributed/cli/tests/test_dask_scheduler.py::test_interface
      2 distributed/cli/tests/test_dask_scheduler.py::test_pid_file
      1 distributed/cli/tests/test_dask_spec.py::test_file
      1 distributed/cli/tests/test_dask_worker.py::test_nanny_worker_port_range
      1 distributed/cli/tests/test_dask_worker.py::test_nprocs_negative
      1 distributed/cli/tests/test_tls_cli.py::test_use_config_file
      1 distributed/dashboard/tests/test_scheduler_bokeh.py::test_task_stream
      2 distributed/deploy/tests/test_adaptive.py::test_adaptive_local_cluster_multi_workers
      9 distributed/deploy/tests/test_adaptive.py::test_target_duration
      1 distributed/deploy/tests/test_slow_adaptive.py::test_startup
     49 distributed/tests/test_actor.py::test_compute
     32 distributed/tests/test_client_executor.py::test_map
      1 distributed/tests/test_client_executor.py::test_shutdown
      3 distributed/tests/test_client.py::test_client_timeout
      1 distributed/tests/test_client.py::test_secede_balances
      5 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation
      1 distributed/tests/test_failed_workers.py::test_forgotten_futures_dont_clean_up_new_futures
      2 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
      1 distributed/tests/test_ipython.py::test_start_ipython_scheduler
      2 distributed/tests/test_nanny.py::test_num_fds
      1 distributed/tests/test_scheduler.py::test_rebalance_least_recently_inserted_sender_min
     34 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
      7 distributed/tests/test_worker.py::test_statistical_profiling

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@crusaderky
Copy link
Collaborator Author

Marked as flaky:

         12 distributed/deploy/tests/test_adaptive.py::test_target_duration
          3 distributed/diagnostics/tests/test_progress.py::test_AllProgress
          9 distributed/diagnostics/tests/test_progress.py::test_AllProgress_lost_key
         79 distributed/tests/test_actor.py::test_compute
         32 distributed/tests/test_client_executor.py::test_map
         12 distributed/tests/test_client.py::test_profile
          5 distributed/tests/test_failed_workers.py::test_broken_worker_during_computation
          8 distributed/tests/test_failed_workers.py::test_worker_who_has_clears_after_failed_connection
         13 distributed/tests/test_scheduler.py::test_memory
         77 distributed/tests/test_steal.py::test_dont_steal_unknown_functions
         10 distributed/tests/test_worker.py::test_statistical_profiling

Tweaked; hopefully they should no longer fail:

      2 distributed/cli/tests/test_dask_scheduler.py::test_pid_file
      3 distributed/cli/tests/test_dask_scheduler.py::test_interface
      3 distributed/tests/test_client.py::test_client_timeout

NOT marked as flaky. The random failures observed in the past WILL show up again, although there is no evidence to support that they will do so on the same test or somewhere else:

      1 distributed/cli/tests/test_dask_spec.py::test_file
      1 distributed/cli/tests/test_dask_worker.py::test_nanny_worker_port_range
      1 distributed/cli/tests/test_dask_worker.py::test_nprocs_negative
      1 distributed/cli/tests/test_tls_cli.py::test_use_config_file
      1 distributed/dashboard/tests/test_scheduler_bokeh.py::test_task_stream
      2 distributed/deploy/tests/test_adaptive.py::test_adaptive_local_cluster_multi_workers
      1 distributed/deploy/tests/test_slow_adaptive.py::test_startup
      1 distributed/tests/test_client_executor.py::test_shutdown
      1 distributed/tests/test_client.py::test_performance_report
      1 distributed/tests/test_client.py::test_secede_balances
      1 distributed/tests/test_core.py::test_ticks
      1 distributed/tests/test_failed_workers.py::test_forgotten_futures_dont_clean_up_new_futures
      1 distributed/tests/test_ipython.py::test_start_ipython_scheduler
      2 distributed/tests/test_nanny.py::test_num_fds
      1 distributed/tests/test_scheduler.py::test_rebalance_least_recently_inserted_sender_min
      1 distributed/tests/test_semaphore.py::test_oversubscribing_leases
      2 distributed/tests/test_variable.py::test_race
      1 distributed/tests/test_worker.py::test_worker_fds

@crusaderky crusaderky closed this Jul 14, 2021
@crusaderky crusaderky reopened this Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

distributed/tests/test_scheduler.py::test_memory failing on main
3 participants