Skip to content

Partition community build for faster CI #10030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 19, 2020

Conversation

griggt
Copy link
Contributor

@griggt griggt commented Oct 17, 2020

Partition member selection is driven by the goal to minimize the running time of the longest running partition.

If this gets merged, the required status checks for pull requests will need updating, as the job named community_build will no longer exist.

Results of some test runs:

The projects in the community build can be partitioned nicely such that the testing of each partition has approximately the same running time (23-27 minutes). Moving some short running projects between the two partitions is mostly a noise-level change.

The real issue that is preventing big wins here is the wide variance in the running time of the GitHub actions/[email protected] steps (e.g. Cache Ivy, Cache SBT, etc.) that are embedded in each job. These steps can run for as little as a few seconds to 20 minutes or more, and their running time seems to be unpredictable. This issue is not particular to this PR and I have noticed it happening on other PRs and merges. Occasionally these long running cache restoration steps will display a warning/error message such as:

gzip: stdin: unexpected end of file
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Unexpected EOF in archive
/usr/bin/tar: Error is not recoverable: exiting now
Warning: Tar failed with error: The process '/usr/bin/tar' failed with exit code 2

but do not actually set a failure status or abort the job.

Curiously, actions/[email protected] is not supported for scheduled events (says: Warning: Event Validation Error: The event type schedule is not supported. Only push, pull_request events are supported at this time.), and those steps do not perform any action during the nightly scheduled CI.

These cache management steps are causing a big performance hit (counterintuitively), and not being used at all in the nightly run (nor the new Windows CI jobs) indicates they may not be necessary. In my last few test runs I modified ci.yaml to remove them, and saw much better results: a full PR/merge CI run in under 30 minutes (see runs 5516, 5517, 5518).

The table below summarizes the running time of each test run. Columns Test A and Test B are the timings for only the Test step of the community_build_x job, and the corresponding Cache A and Cache B columns indicate the additional time spent for that job on Cache steps. The Workflow column is the running time for the entire workflow, and is only given for those runs where all jobs are enabled (for the the first seven rows of the table, the workflow only ran the new community build jobs, with varying choices for the partition members).

  CI # Test A Test B Cache A Cache B Workflow Notes
Only community_build_x jobs enabled:
5507 26:29 21:33 5:09 20:11   Original partition selection
5509 27:50 21:29 4:08 20:26   (as above)
5510 23:16 25:02 2:33 4:59   Move algebra, fastparse, ScalaPB to partition B
5511 Fail* 23:42 2:19 2:21   (as above)
5512 23:20 27:33 29:19 1:33   (as above)
5513 22:50 26:53 3:49 1:43   Move fastparse back to partition A
5514 26:08 23:03 16:54 5:00   (as above)
All normal CI jobs enabled:
5515 22:58 26:22 7:33 27:09 1:00:14
All normal CI jobs enabled; steps using actions/[email protected] removed:
5516 23:15 29:04 - - 29:42
5517 27:36 23:18 - - 28:20
5518 26:05 26:29 - - 27:08


 * Spurious failure

I am cautiously reluctant to recommend removing all the actions/[email protected] steps without at least a better understanding of what's going on and some idea of what, if any benefits they bring. I also notice that the most current available version of the action is v2.1.2, which may fix some issues?

EDIT: cache issues have been fixed by #10197.

Fixes #9599

Copy link
Member

@dottybot dottybot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, and thank you for opening this PR! 🎉

All contributors have signed the CLA, thank you! ❤️

Have an awesome day! ☀️

@griggt griggt force-pushed the partition-community-build branch 6 times, most recently from ab3dcf1 to d0a0348 Compare October 18, 2020 01:29
@griggt griggt marked this pull request as ready for review October 18, 2020 04:27
@griggt griggt force-pushed the partition-community-build branch from d0a0348 to a93a65a Compare October 27, 2020 07:25
Copy link
Member

@smarter smarter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: oops, wrong PR

@griggt

This comment has been minimized.

@smarter

This comment has been minimized.

@griggt

This comment has been minimized.

@smarter smarter dismissed their stale review October 27, 2020 18:34

wrong PR

@griggt
Copy link
Contributor Author

griggt commented Nov 6, 2020

Will put this on hold while the caching issues are investigated in #10197.

@griggt griggt marked this pull request as draft November 6, 2020 00:35
@griggt griggt force-pushed the partition-community-build branch from a93a65a to 5b29736 Compare November 11, 2020 04:24
@griggt
Copy link
Contributor Author

griggt commented Nov 11, 2020

Now that #10197 has been merged, this has been rebased and review comments addressed.

@griggt griggt marked this pull request as ready for review November 11, 2020 05:41
@anatoliykmetyuk anatoliykmetyuk merged commit f3bb4e8 into scala:master Nov 19, 2020
@griggt griggt deleted the partition-community-build branch November 21, 2020 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Split the community build into two jobs on the CI to speed it up?
4 participants