Skip to content

Fix #10273: Use Docker bind mounts for CI caching #10197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 10, 2020

Conversation

griggt
Copy link
Contributor

@griggt griggt commented Nov 5, 2020

Issues with GitHub Actions caching are causing intermittent significant slowdown in CI (and also some spurious failures, I believe). Let's see what we can do about that.


The issues have been found to be caused by shortcomings in the implementation of the GitHub actions/cache@v1 action (and its associated backend service), and have been experienced by other users as well.

A number of tickets have been filed on their issue tracker related to the problems we have been experiencing:

These issues have been mostly mitigated in the newer actions/cache@v2, but extensive testing has uncovered that they are not completely eliminated, and occasionally cache restoration will take 20+ minutes.

This investigation also revealed that we did not necessarily realize that actions/cache stores its cache data on GitHub servers and must re-download it on every run, even when using self-hosted runners, which is less than ideal for our use case.

The solution proposed here is to drop the usage of actions/cache altogether, and instead use per-runner, locally stored, persistent caches, implemented using Docker bind mounts.

Fixes #10273

Copy link
Member

@dottybot dottybot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, and thank you for opening this PR! 🎉

All contributors have signed the CLA, thank you! ❤️

Have an awesome day! ☀️

@smarter
Copy link
Member

smarter commented Nov 6, 2020

I think it's also worth trying having a single cache step that contains all the directories we want to cache instead of multiple steps.

@griggt
Copy link
Contributor Author

griggt commented Nov 6, 2020

It is indeed possible to specify multiple paths to cache after upgrading to v2 of the cache action.

I shied away from that approach after some inital testing locally, after finding that although cache retrieval is much improved in v2, it will still occasionally hang for 10+ minutes while downloading the cache tarball from GitHub. It seemed that the larger the tarball was, the greater the likelihood that the download would get stuck.

Also, I reworked the cache keys so that jobs with different dependencies will use a different cache id. The Coursier cache for the community build is much larger and different that the cache for the compiler test job, for example, and with these changes, they will be stored separately. Using a single cache id (as was done before) for jobs that have very different dependencies results in a suboptimal cache for some of the jobs, as the way the cache action seems to work is that the cache for a given id will only be created once, and by the job that happens to finish first. So if we use a shared cache for test and community_build, and the test job finishes first, it will be the one to create the cache entry; the community_build job by virtue of finishing later will have its cache discarding rather than updating the shared cache id, and so subsequent runs of community_build will use the test job cache which is missing a lot.

I'm rambling off the cuff, my apologies. I will work on a better (and hopefully clearer) write up later tonight after some further test runs and looking over my notes. But my local testing over the past few days leads me to be optimistic that we should see a marked improvement in CI performance (at least on Linux -- the Windows bottleneck is a different story of course).

@smarter
Copy link
Member

smarter commented Nov 6, 2020

it will still occasionally hang for 10+ minutes while downloading the cache tarball from GitHub.

That's weird, but I don't think it's actually downloading things from github: we run the github runners on our own machines, and the caches get created on these machines, so all it should be doing is copying that cache in the docker container created for the current job and unpacking it there, if that takes 10 minutes something is definitely wrong.

@griggt
Copy link
Contributor Author

griggt commented Nov 6, 2020

The caches are stored on GitHub servers, even for self-hosted runners. This really surprised me. I thought it worked the way you describe.

The reason the cache steps right now are so slow occasionally is that the tarball download gets stuck, or proceeds at a glacial pace, as the GitHub cache endpoint stops sending data. There are several issues on the GitHub actions issue tracker confirming this is the issue and the fix is included in the v2 upgrade, and I saw the same thing when I set up self-hosted runners locally and had all sorts of network/disk/etc monitors running.

The v1 cache actually did not delete the local tarballs so that may be what you're seeing.

The v2 cache action displays download progress, so when it does get stuck, it shows up in the logs.

Oh, and there is a 5GB per-repository quota and 7-day limit for the caches before they get evicted.

@smarter smarter closed this Nov 6, 2020
@smarter smarter reopened this Nov 6, 2020
@smarter
Copy link
Member

smarter commented Nov 6, 2020

The v1 cache actually did not delete the local tarballs so that may be what you're seeing.

Ah yes probably. So if there's no way to bypass this behavior, maybe we should get rid of actions/cache and write our own action to cache things locally if that doesn't exist yet?

@griggt
Copy link
Contributor Author

griggt commented Nov 6, 2020

Are you volunteering? :-)

I had the same thought many days ago when I started looking into this, but didn't want to invest an (even larger) chunk of time and looked for a simpler (pre-made) solution.

After digging through the issue tracker though, I'm surprised the thing works at all. It seems to be far from what I would call "mature".

@griggt
Copy link
Contributor Author

griggt commented Nov 6, 2020

I think I might have seen that someone cobbled something together that uses AWS S3 for storage. I'll try to dig that up as a additional point of reference.

@smarter
Copy link
Member

smarter commented Nov 6, 2020

I can't figure out if there's even a way to bind a volume in the container to use as a cache, maybe we could run rsyncd on the server, pass the local IP to the container and connect to that?

@griggt griggt force-pushed the ci-caching branch 5 times, most recently from edbb82c to 7a9cd68 Compare November 6, 2020 04:31
@griggt
Copy link
Contributor Author

griggt commented Nov 6, 2020

The GitHub Actions docs (https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-syntax-for-github-actions#jobsjob_idcontainer) seem to indicate that it is possible to pass docker options and additional volumes to bind as container parameters in the workflow file.

A community support message (https://github.community/t/docker-action-cant-set-docker-options/17343) from January indicates that the 'volumes' option may not (yet?) work, but that it was possible to pass the volume bind options directly to docker using the 'options' parameter.

Certainly not the most elegant solution, having to cut/paste the raw volume bind options everywhere but it might work.

@griggt griggt force-pushed the ci-caching branch 3 times, most recently from dc9fc87 to 2f24952 Compare November 6, 2020 06:34
@griggt
Copy link
Contributor Author

griggt commented Nov 6, 2020

Initial testing is looking good. I ran the workflow 12 times tonight, often 2 or 3 instances concurrently.

Cache restoration (download and extraction of the tarball from GitHub) is now consistently varying between 10 and 60 seconds depending on the size of the tarball. I have yet to see the cache download stall out or fail. Overall workflow run time is typically 35 minutes or less.

Still to do:

  • address caching for nightly scheduled jobs & release jobs
  • coherent write-up of the issues, changes, and cache management strategy
  • clean up and ready for review/merge

The existing nightly jobs have caching steps defined in the workflow file but they are basically NOPs since the v1 cache action does not support running on 'schedule' events (this restriction is lifted in v2).

To me this already looks like a big improvement over the status quo, and at the least will buy us time to explore alternative solutions that don't require downloading the cache tarballs from GitHub on every run.

@griggt
Copy link
Contributor Author

griggt commented Nov 7, 2020

if there's even a way to bind a volume in the container to use as a cache

I performed a quick experiment:

   test:
     runs-on: [self-hosted, Linux]
-    container: lampepfl/dotty:2020-04-24
+    container:
+      image: lampepfl/dotty:2020-04-24
+      volumes:
+        - ci-cache-sbt:/root/.sbt
+        - ci-cache-ivy:/root/.ivy2/cache
+        - ci-cache-general:/root/.cache

which was successful and demonstrated that it is possible to use Docker named volumes to create persistent, per-host, locally stored caches, should we want to proceed in that direction (or some variant thereof).

@smarter what do you think?

@smarter
Copy link
Member

smarter commented Nov 8, 2020

Nice! So here ci-cache-sbt would be the name of a docker volume created on the host, right? If that works that sounds great. One thing I worry about is concurrent modifications to the cache since we have multiple runners on the same machine, but I don't see a way to access the name of the runner from the workflow config file, and sbt/coursier are supposed to be able to handle concurrent accesses correctly, so we could just try it and see if it's a problem in practice. It seems that docker will automatically create a named volume on the host if it doesn't exist so it sounds like you could experiment with this in this PR without any config change to the hosts?

@griggt
Copy link
Contributor Author

griggt commented Nov 8, 2020

So here ci-cache-sbt would be the name of a docker volume created on the host, right?

Exactly.

It seems that docker will automatically create a named volume on the host if it doesn't exist so it sounds like you could experiment with this in this PR without any config change to the hosts?

Indeed. I actually did the proof-of-concept test by opening a PR on one of my own repos without having created the docker volumes, precisely to test that this functionality works.

Of course I cannot delete any volumes that are created on the EPFL runners, but I presume that you or someone else can take care of any cleanup that becomes necessary.

One thing I worry about is concurrent modifications to the cache since we have multiple runners on the same machine

Yeah, I thought that might be the case. I'll play around with it a bit more on my end by adding more local runners and try to root out any concurrency issues.

I don't see a way to access the name of the runner from the workflow config file

I'll do some more digging on this too.

@griggt
Copy link
Contributor Author

griggt commented Nov 8, 2020

sbt/coursier are supposed to be able to handle concurrent accesses correctly

Well, unfortunately it didn't take long to demonstrate that this doesn't seem to be the case. It failed the first test.

I set up two self-hosted runners and created fresh, empty docker volumes shared between them for the caches, as discussed above. The workflow consisted of two copies of the test job, one for each runner.

Neither job was able to finish loading the sbt project definition before failing.

It looks like the initial sbt startup and caching (in ~/.sbt and ~/.ivy2) was ok (serialized by a lock file?), but once coursier got involved things went sideways. Several of the jars in the coursier cache ended up corrupt and failed checksums (oversize, having been written to from both runners concurrently it seems), and there were many other errors as well.

The logs from each runner:

runner-1.txt
runner-2.txt

I repeated the test and it failed once again, in a similar yet possibly slightly different manner.

@griggt
Copy link
Contributor Author

griggt commented Nov 8, 2020

I found that it is possible to create docker mounts that are isolated per-runner, but it is a touch hacky:

  test:
    runs-on: [self-hosted, Linux]
    container:
      image: lampepfl/dotty:2020-04-24
      volumes:
        - ${{ github.workspace }}/../cache/sbt:/root/.sbt
        - ${{ github.workspace }}/../cache/ivy:/root/.ivy2/cache
        - ${{ github.workspace }}/../cache/general:/root/.cache

where ${{ github.workspace }} expands to:

/home/drone/github-actions-runners/runner-2/_work/dotty/dotty
                                   ^^^^^^^^
                                   differs for each runner

which results in new directories being created on the host and used as the source for bind mounts at:

/home/drone/github-actions-runners/runner-2/_work/dotty/cache/sbt
/home/drone/github-actions-runners/runner-2/_work/dotty/cache/ivy
/home/drone/github-actions-runners/runner-2/_work/dotty/cache/general

The ../ is necessary to avoid the cache landing in the root directory of the checked-out repo.

One potential downside of this approach is that the the cache dir is also visible in the container at /__w/dotty/cache, since /home/drone/github-actions-runners/runner-2/_work is automatically bind mounted at /__w.

I'm also not sure how much this approach relies on implementation defined behavior that may be subject to change.

There don't seem to be other options as far as context variables go that allow distinguishing one runner from another in the workflow file. It would be nice if there was a ${{ runner.name }}, but that notwithstanding, for some unknown reason the entire runner context does not seem to be in scope for use in the container volumes section of the workflow file, so it wouldn't do much good anyway.

In very limited testing, however, this solution does work so far, and provides locally stored, isolated per-runner, persistent cache.

@smarter
Copy link
Member

smarter commented Nov 8, 2020

One potential downside of this approach is that the the cache dir is also visible in the container at /__w/dotty/cache, since /home/drone/github-actions-runners/runner-2/_work is automatically bind mounted at /__w.

Not a big deal, but perhaps we can create these directories two level up, in ${{ github.workspace }}/../../ ? Also since those aren't docker named volumes I imagine I have to go mkdir them?

@griggt griggt force-pushed the ci-caching branch 2 times, most recently from 534ce6a to c31d202 Compare November 9, 2020 04:22
@griggt
Copy link
Contributor Author

griggt commented Nov 9, 2020

One minor issue with using bind mounts rather than named volumes is that they obscure the existing contents of the target directory in the container. This means that the /root/.sbt/repositories file (for the artifactory proxy) from the Docker image is no longer visible since we mount the cache on top of /root/.sbt.

I have worked around this for now by adding a copy of the repositories file to the PR and a step in every CI job to copy it into place. However I think a cleaner solution would be to update the docker image to store the repositories file in a different location and point to it using -Dsbt.repository.config in the global sbt config file of the image. I can open a PR in lampepfl/dotty-ci to do this if so desired.

Other than that, testing went well. I will write up a summary for review so we can decide on a path forward.

@smarter
Copy link
Member

smarter commented Nov 9, 2020

However I think a cleaner solution would be to update the docker image to store the repositories file in a different location and point to it using -Dsbt.repository.config in the global sbt config file of the image.

One problem with that is that Java properties are not propagated when spawning process and we spawn sbt processes in the community build for example. So the solution based on copying the repositories file from the git repo actually sounds pretty good to me, it also makes it easier to tweak the repositories file if needed without having to regenerate the docker image.

@griggt
Copy link
Contributor Author

griggt commented Nov 9, 2020

So the solution based on copying the repositories file from the git repo actually sounds pretty good to me, it also makes it easier to tweak the repositories file if needed without having to regenerate the docker image.

Okay, will leave it as is then. We should probably delete the repositories file from the docker image if we merge it into the repo here, to avoid any confusion about which one is in use.

@smarter
Copy link
Member

smarter commented Nov 9, 2020

Other than that, testing went well. I will write up a summary for review so we can decide on a path forward.

Great to hear that, but by the way, have you checked whether this worked for the Windows jobs too?

@griggt
Copy link
Contributor Author

griggt commented Nov 9, 2020

have you checked whether this worked for the Windows jobs too?

I have not. Are the Windows jobs running in a Docker container? I don't see one defined in the workflow file.

@smarter
Copy link
Member

smarter commented Nov 9, 2020

oh, I guess it runs directly on the host so there's no special cache handling needed: https://github.com/lampepfl/dotty/blob/ca216a851037369fd83fd1bc31f3331a1a84424d/.github/workflows/ci.yaml#L119-L134

@griggt
Copy link
Contributor Author

griggt commented Nov 9, 2020

I guess it runs directly on the host

Sounds like a potential security vulnerability in the making, but I think that's out of scope for this PR.

@smarter
Copy link
Member

smarter commented Nov 9, 2020

Sounds like a potential security vulnerability in the making

We're aware but Github Actions apparently doesn't support Windows containers so we just took some remedial steps (the machine is in its own network and can't access anything else)

@griggt
Copy link
Contributor Author

griggt commented Nov 10, 2020

Solutions explored in this PR, and some pros/cons of each:

1. Upgrade to GitHub actions/cache@v2 + use more refined hash keys

This newer version of actions/cache has improvements to mitigate the known issues with cache restore being intermittently slow and occasionally failing, as well as other improvements, such as being able to use the cache for scheduled jobs. Refining the set of cache keys in use permits smaller, more focused per-job caches, and helps in caching more artifacts than we do currently.

Pros:

  • Least radical change compared to the status quo.
  • All runners have the same view of cache.
  • Hierarchical cache (feature branches and PRs have their own cache scope, etc.)

Cons:

  • All cache data is still stored on GitHub servers and cache tarballs must be downloaded at the start of each job that uses cache. Each tarball is several hundred MB.
  • The improvements to actions/cache@v2 don't completely solve the issues, and GitHub cache endpoints still occasionally stop sending data, delaying cache restoration, sometimes 20+ minutes.
  • Quota of 5GB cache per repository.
  • Cache is evicted after 7 days of non-access.
  • No mechanism to inspect which caches currently exist on GitHub servers, nor their size or contents.
  • Caches are read-only and can only be "updated" by using a new cache key, effectively creating a new cache. This makes it difficult to create caches that can be shared among jobs that may have different dependencies with a large common subset.
  • Management of cache keys to properly capture distinct caches and keep cache tarballs reasonably sized can be tricky.

2. Eliminate actions/cache, instead use Docker named volume mounts (per-host local cache)

This involves creating Docker named volumes for each cache on each runner host, and mounting them at the appropriate locations in the container filesystem. The cache is stored locally on each host, and shared among all of its runners.

Pros:

  • No network traffic to/from GitHub to save/restore cache.
  • No CPU/IO overhead to save/restore cache, since it is persistent.
  • All runners on a host see cache updates simultaneously.
  • Easy to implement in the worklow file.

Cons:

  • Cache updates on one host will not be visible to the others.
  • There does not appear to be a way to create per-runner volume names in the workflow file, so this option is limited to per-host caching, which may have concurrency issues.
  • PRs could pollute the cache (mostly an issue of wasted storage, I suppose)

This option turns out to be a non-starter as it was discovered experimentally that concurrent updates to the coursier cache may result in corruption.

3. Eliminate actions/cache, instead use Docker bind mounts (per-runner local cache)

Similar to solution 2 above, but using bind mounts rather than named volumes, and isolating the caches on a per-runner rather than per-host basis. We use bind mounts rather than named volumes due to limitations around distinguishing runners in the GitHub actions workflow file.

Pros:

  • No network traffic to/from GitHub to save/restore cache.
  • No CPU/IO overhead to save/restore cache, since it is persistent.
  • Concurrency issues are eliminated compared to solution 2.
  • Easy to implement in the worklow file.

Cons:

  • Additional storage needed.
  • Cache updates on one runner will not be visible to the others.
  • Bind mounts obscure the existing contents of the target directory (minor issue w/workaround).
  • PRs could pollute the cache (mostly an issue of wasted storage, I suppose)
  • Makes a few (reasonable) assumptions about the GitHub Actions runner implementation.

Sizes of the local persistent caches after population via testing (may grow after other jobs run such as publish/release):

Cache Size
sbt ~166 MB
ivy ~80 MB
coursier/mill ~1.3 GB

@griggt
Copy link
Contributor Author

griggt commented Nov 10, 2020

Aside: should we be caching the contents of ~/.mill? We aren't currently...

@smarter
Copy link
Member

smarter commented Nov 10, 2020

Is there anything that gets put in .mill and not in .cache/mill these days?

@griggt
Copy link
Contributor Author

griggt commented Nov 10, 2020

After running community_build_a:

$ du -hd2 /root
36M	/root/.ivy2/local
80M	/root/.ivy2/cache
115M	/root/.ivy2
1.1G	/root/.cache/coursier
8.0K	/root/.cache/JNA
1.1G	/root/.cache
6.0M	/root/.mill/ammonite
6.0M	/root/.mill
97M	/root/.sbt/boot
1.5M	/root/.sbt/1.0
68M	/root/.sbt/0.13
166M	/root/.sbt
1.4G	/root

Not sure what's in there.

@smarter
Copy link
Member

smarter commented Nov 10, 2020

Seems to be a cache for compiled ammonite scripts (in our case, probably just from build files in community-build projects using mill), not worth caching.

This enables per-runner, local, persistent cache for CI.
Docker bind mounts obscure the existing contents of a directory,
so the /root/.sbt/repositories file from the Docker image is no
longer visible after we bind mount on top of /root/.sbt
@griggt griggt changed the title Investigate CI caching issues Fix #10273: use Docker bind mounts for CI caching Nov 10, 2020
@griggt griggt changed the title Fix #10273: use Docker bind mounts for CI caching Fix #10273: Use Docker bind mounts for CI caching Nov 10, 2020
@griggt griggt marked this pull request as ready for review November 10, 2020 23:03
Copy link
Member

@smarter smarter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for investigating and fixing this!

@smarter smarter merged commit 306e4fc into scala:master Nov 10, 2020
@griggt griggt deleted the ci-caching branch November 10, 2020 23:44
griggt added a commit to griggt/dotty-ci that referenced this pull request Nov 11, 2020
This file has been incorporated into the Dotty repository
by scala/scala3#10197.
griggt added a commit to griggt/dotty-ci that referenced this pull request Nov 11, 2020
This file has been incorporated into the Dotty repository
and GitHub Actions workflow by scala/scala3#10197.
@griggt
Copy link
Contributor Author

griggt commented Nov 11, 2020

I opened lampepfl/dotty-ci#18 to remove the repositories file.

romanowski pushed a commit to romanowski/dotty that referenced this pull request Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GitHub Actions caching causing slowdown and spurious failures in CI
3 participants