-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Partition community build for faster CI #10030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition community build for faster CI #10030
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, and thank you for opening this PR! 🎉
All contributors have signed the CLA, thank you! ❤️
Have an awesome day! ☀️
ab3dcf1
to
d0a0348
Compare
d0a0348
to
a93a65a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: oops, wrong PR
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Will put this on hold while the caching issues are investigated in #10197. |
a93a65a
to
5b29736
Compare
Now that #10197 has been merged, this has been rebased and review comments addressed. |
Partition member selection is driven by the goal to minimize the running time of the longest running partition.
If this gets merged, the required status checks for pull requests will need updating, as the job named
community_build
will no longer exist.Results of some test runs:
The projects in the community build can be partitioned nicely such that the testing of each partition has approximately the same running time (23-27 minutes). Moving some short running projects between the two partitions is mostly a noise-level change.
The real issue that is preventing big wins here is the wide variance in the running time of the GitHub
actions/[email protected]
steps (e.g.Cache Ivy
,Cache SBT
, etc.) that are embedded in each job. These steps can run for as little as a few seconds to 20 minutes or more, and their running time seems to be unpredictable. This issue is not particular to this PR and I have noticed it happening on other PRs and merges. Occasionally these long running cache restoration steps will display a warning/error message such as:but do not actually set a failure status or abort the job.
Curiously,
actions/[email protected]
is not supported for scheduled events (says:Warning: Event Validation Error: The event type schedule is not supported. Only push, pull_request events are supported at this time.
), and those steps do not perform any action during the nightly scheduled CI.These cache management steps are causing a big performance hit (counterintuitively), and not being used at all in the nightly run (nor the new Windows CI jobs) indicates they may not be necessary. In my last few test runs I modified
ci.yaml
to remove them, and saw much better results: a full PR/merge CI run in under 30 minutes (see runs 5516, 5517, 5518).The table below summarizes the running time of each test run. Columns
Test A
andTest B
are the timings for only theTest
step of the community_build_x job, and the correspondingCache A
andCache B
columns indicate the additional time spent for that job onCache
steps. TheWorkflow
column is the running time for the entire workflow, and is only given for those runs where all jobs are enabled (for the the first seven rows of the table, the workflow only ran the new community build jobs, with varying choices for the partition members).* Spurious failure
I am cautiously reluctant to recommend removing all the
actions/[email protected]
steps without at least a better understanding of what's going on and some idea of what, if any benefits they bring. I also notice that the most current available version of the action isv2.1.2
, which may fix some issues?EDIT: cache issues have been fixed by #10197.
Fixes #9599