-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix #10273: Use Docker bind mounts for CI caching #10197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, and thank you for opening this PR! 🎉
All contributors have signed the CLA, thank you! ❤️
Have an awesome day! ☀️
I think it's also worth trying having a single cache step that contains all the directories we want to cache instead of multiple steps. |
It is indeed possible to specify multiple paths to cache after upgrading to v2 of the cache action. I shied away from that approach after some inital testing locally, after finding that although cache retrieval is much improved in v2, it will still occasionally hang for 10+ minutes while downloading the cache tarball from GitHub. It seemed that the larger the tarball was, the greater the likelihood that the download would get stuck. Also, I reworked the cache keys so that jobs with different dependencies will use a different cache id. The Coursier cache for the community build is much larger and different that the cache for the compiler test job, for example, and with these changes, they will be stored separately. Using a single cache id (as was done before) for jobs that have very different dependencies results in a suboptimal cache for some of the jobs, as the way the cache action seems to work is that the cache for a given id will only be created once, and by the job that happens to finish first. So if we use a shared cache for I'm rambling off the cuff, my apologies. I will work on a better (and hopefully clearer) write up later tonight after some further test runs and looking over my notes. But my local testing over the past few days leads me to be optimistic that we should see a marked improvement in CI performance (at least on Linux -- the Windows bottleneck is a different story of course). |
That's weird, but I don't think it's actually downloading things from github: we run the github runners on our own machines, and the caches get created on these machines, so all it should be doing is copying that cache in the docker container created for the current job and unpacking it there, if that takes 10 minutes something is definitely wrong. |
The caches are stored on GitHub servers, even for self-hosted runners. This really surprised me. I thought it worked the way you describe. The reason the cache steps right now are so slow occasionally is that the tarball download gets stuck, or proceeds at a glacial pace, as the GitHub cache endpoint stops sending data. There are several issues on the GitHub actions issue tracker confirming this is the issue and the fix is included in the v2 upgrade, and I saw the same thing when I set up self-hosted runners locally and had all sorts of network/disk/etc monitors running. The v1 cache actually did not delete the local tarballs so that may be what you're seeing. The v2 cache action displays download progress, so when it does get stuck, it shows up in the logs. Oh, and there is a 5GB per-repository quota and 7-day limit for the caches before they get evicted. |
Ah yes probably. So if there's no way to bypass this behavior, maybe we should get rid of actions/cache and write our own action to cache things locally if that doesn't exist yet? |
Are you volunteering? :-) I had the same thought many days ago when I started looking into this, but didn't want to invest an (even larger) chunk of time and looked for a simpler (pre-made) solution. After digging through the issue tracker though, I'm surprised the thing works at all. It seems to be far from what I would call "mature". |
I think I might have seen that someone cobbled something together that uses AWS S3 for storage. I'll try to dig that up as a additional point of reference. |
I can't figure out if there's even a way to bind a volume in the container to use as a cache, maybe we could run |
edbb82c
to
7a9cd68
Compare
The GitHub Actions docs (https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-syntax-for-github-actions#jobsjob_idcontainer) seem to indicate that it is possible to pass docker options and additional volumes to bind as container parameters in the workflow file. A community support message (https://github.community/t/docker-action-cant-set-docker-options/17343) from January indicates that the 'volumes' option may not (yet?) work, but that it was possible to pass the volume bind options directly to docker using the 'options' parameter. Certainly not the most elegant solution, having to cut/paste the raw volume bind options everywhere but it might work. |
dc9fc87
to
2f24952
Compare
Initial testing is looking good. I ran the workflow 12 times tonight, often 2 or 3 instances concurrently. Cache restoration (download and extraction of the tarball from GitHub) is now consistently varying between 10 and 60 seconds depending on the size of the tarball. I have yet to see the cache download stall out or fail. Overall workflow run time is typically 35 minutes or less. Still to do:
The existing nightly jobs have caching steps defined in the workflow file but they are basically NOPs since the v1 cache action does not support running on 'schedule' events (this restriction is lifted in v2). To me this already looks like a big improvement over the status quo, and at the least will buy us time to explore alternative solutions that don't require downloading the cache tarballs from GitHub on every run. |
I performed a quick experiment: test:
runs-on: [self-hosted, Linux]
- container: lampepfl/dotty:2020-04-24
+ container:
+ image: lampepfl/dotty:2020-04-24
+ volumes:
+ - ci-cache-sbt:/root/.sbt
+ - ci-cache-ivy:/root/.ivy2/cache
+ - ci-cache-general:/root/.cache which was successful and demonstrated that it is possible to use Docker named volumes to create persistent, per-host, locally stored caches, should we want to proceed in that direction (or some variant thereof). @smarter what do you think? |
Nice! So here |
Exactly.
Indeed. I actually did the proof-of-concept test by opening a PR on one of my own repos without having created the docker volumes, precisely to test that this functionality works. Of course I cannot delete any volumes that are created on the EPFL runners, but I presume that you or someone else can take care of any cleanup that becomes necessary.
Yeah, I thought that might be the case. I'll play around with it a bit more on my end by adding more local runners and try to root out any concurrency issues.
I'll do some more digging on this too. |
Well, unfortunately it didn't take long to demonstrate that this doesn't seem to be the case. It failed the first test. I set up two self-hosted runners and created fresh, empty docker volumes shared between them for the caches, as discussed above. The workflow consisted of two copies of the Neither job was able to finish loading the sbt project definition before failing. It looks like the initial sbt startup and caching (in ~/.sbt and ~/.ivy2) was ok (serialized by a lock file?), but once coursier got involved things went sideways. Several of the jars in the coursier cache ended up corrupt and failed checksums (oversize, having been written to from both runners concurrently it seems), and there were many other errors as well. The logs from each runner: I repeated the test and it failed once again, in a similar yet possibly slightly different manner. |
I found that it is possible to create docker mounts that are isolated per-runner, but it is a touch hacky: test:
runs-on: [self-hosted, Linux]
container:
image: lampepfl/dotty:2020-04-24
volumes:
- ${{ github.workspace }}/../cache/sbt:/root/.sbt
- ${{ github.workspace }}/../cache/ivy:/root/.ivy2/cache
- ${{ github.workspace }}/../cache/general:/root/.cache where
which results in new directories being created on the host and used as the source for bind mounts at:
The One potential downside of this approach is that the the cache dir is also visible in the container at I'm also not sure how much this approach relies on implementation defined behavior that may be subject to change. There don't seem to be other options as far as context variables go that allow distinguishing one runner from another in the workflow file. It would be nice if there was a In very limited testing, however, this solution does work so far, and provides locally stored, isolated per-runner, persistent cache. |
Not a big deal, but perhaps we can create these directories two level up, in |
534ce6a
to
c31d202
Compare
One minor issue with using bind mounts rather than named volumes is that they obscure the existing contents of the target directory in the container. This means that the I have worked around this for now by adding a copy of the Other than that, testing went well. I will write up a summary for review so we can decide on a path forward. |
One problem with that is that Java properties are not propagated when spawning process and we spawn sbt processes in the community build for example. So the solution based on copying the repositories file from the git repo actually sounds pretty good to me, it also makes it easier to tweak the repositories file if needed without having to regenerate the docker image. |
Okay, will leave it as is then. We should probably delete the repositories file from the docker image if we merge it into the repo here, to avoid any confusion about which one is in use. |
Great to hear that, but by the way, have you checked whether this worked for the Windows jobs too? |
I have not. Are the Windows jobs running in a Docker container? I don't see one defined in the workflow file. |
oh, I guess it runs directly on the host so there's no special cache handling needed: https://github.com/lampepfl/dotty/blob/ca216a851037369fd83fd1bc31f3331a1a84424d/.github/workflows/ci.yaml#L119-L134 |
Sounds like a potential security vulnerability in the making, but I think that's out of scope for this PR. |
We're aware but Github Actions apparently doesn't support Windows containers so we just took some remedial steps (the machine is in its own network and can't access anything else) |
Solutions explored in this PR, and some pros/cons of each: 1. Upgrade to GitHub actions/cache@v2 + use more refined hash keys This newer version of actions/cache has improvements to mitigate the known issues with cache restore being intermittently slow and occasionally failing, as well as other improvements, such as being able to use the cache for scheduled jobs. Refining the set of cache keys in use permits smaller, more focused per-job caches, and helps in caching more artifacts than we do currently. Pros:
Cons:
2. Eliminate actions/cache, instead use Docker named volume mounts (per-host local cache) This involves creating Docker named volumes for each cache on each runner host, and mounting them at the appropriate locations in the container filesystem. The cache is stored locally on each host, and shared among all of its runners. Pros:
Cons:
This option turns out to be a non-starter as it was discovered experimentally that concurrent updates to the coursier cache may result in corruption. 3. Eliminate actions/cache, instead use Docker bind mounts (per-runner local cache) Similar to solution 2 above, but using bind mounts rather than named volumes, and isolating the caches on a per-runner rather than per-host basis. We use bind mounts rather than named volumes due to limitations around distinguishing runners in the GitHub actions workflow file. Pros:
Cons:
Sizes of the local persistent caches after population via testing (may grow after other jobs run such as publish/release):
|
Aside: should we be caching the contents of |
Is there anything that gets put in .mill and not in .cache/mill these days? |
After running community_build_a:
Not sure what's in there. |
Seems to be a cache for compiled ammonite scripts (in our case, probably just from build files in community-build projects using mill), not worth caching. |
This enables per-runner, local, persistent cache for CI.
Docker bind mounts obscure the existing contents of a directory, so the /root/.sbt/repositories file from the Docker image is no longer visible after we bind mount on top of /root/.sbt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for investigating and fixing this!
This file has been incorporated into the Dotty repository by scala/scala3#10197.
This file has been incorporated into the Dotty repository and GitHub Actions workflow by scala/scala3#10197.
I opened lampepfl/dotty-ci#18 to remove the repositories file. |
Issues with GitHub Actions caching are causing intermittent significant slowdown in CI (and also some spurious failures, I believe). Let's see what we can do about that.
The issues have been found to be caused by shortcomings in the implementation of the GitHub
actions/cache@v1
action (and its associated backend service), and have been experienced by other users as well.A number of tickets have been filed on their issue tracker related to the problems we have been experiencing:
These issues have been mostly mitigated in the newer
actions/cache@v2
, but extensive testing has uncovered that they are not completely eliminated, and occasionally cache restoration will take 20+ minutes.This investigation also revealed that we did not necessarily realize that
actions/cache
stores its cache data on GitHub servers and must re-download it on every run, even when using self-hosted runners, which is less than ideal for our use case.The solution proposed here is to drop the usage of
actions/cache
altogether, and instead use per-runner, locally stored, persistent caches, implemented using Docker bind mounts.Fixes #10273