Fix #10273: Use Docker bind mounts for CI caching #10197

griggt · 2020-11-05T23:51:09Z

Issues with GitHub Actions caching are causing intermittent significant slowdown in CI (and also some spurious failures, I believe). Let's see what we can do about that.

The issues have been found to be caused by shortcomings in the implementation of the GitHub actions/cache@v1 action (and its associated backend service), and have been experienced by other users as well.

A number of tickets have been filed on their issue tracker related to the problems we have been experiencing:

Networking issues impacting cache save/restore actions/cache#267 Networking issues impacting cache save/restore
actions/cache@v1 jumps from ~24s to ~20m actions/cache#436 actions/cache@v1 jumps from ~24s to ~20m
Very slow cache restoration actions/cache#381 Very slow cache restoration
Failed restore should fail step actions/cache#141 Failed restore should fail step
Getting cache sometimes fails badly (networking issue?) actions/cache#265 Getting cache sometimes fails badly (networking issue?)
Github action is stuck when trying to restore the cache actions/cache#207 Github action is stuck when trying to restore the cache
Cache download can hang forever when fetched from multiple concurrent jobs actions/cache#189 Cache download can hang forever when fetched from multiple concurrent jobs

These issues have been mostly mitigated in the newer actions/cache@v2, but extensive testing has uncovered that they are not completely eliminated, and occasionally cache restoration will take 20+ minutes.

This investigation also revealed that we did not necessarily realize that actions/cache stores its cache data on GitHub servers and must re-download it on every run, even when using self-hosted runners, which is less than ideal for our use case.

The solution proposed here is to drop the usage of actions/cache altogether, and instead use per-runner, locally stored, persistent caches, implemented using Docker bind mounts.

Fixes #10273

dottybot

Hello, and thank you for opening this PR! 🎉

All contributors have signed the CLA, thank you! ❤️

Have an awesome day! ☀️

smarter · 2020-11-06T00:41:43Z

I think it's also worth trying having a single cache step that contains all the directories we want to cache instead of multiple steps.

griggt · 2020-11-06T01:03:05Z

It is indeed possible to specify multiple paths to cache after upgrading to v2 of the cache action.

I shied away from that approach after some inital testing locally, after finding that although cache retrieval is much improved in v2, it will still occasionally hang for 10+ minutes while downloading the cache tarball from GitHub. It seemed that the larger the tarball was, the greater the likelihood that the download would get stuck.

Also, I reworked the cache keys so that jobs with different dependencies will use a different cache id. The Coursier cache for the community build is much larger and different that the cache for the compiler test job, for example, and with these changes, they will be stored separately. Using a single cache id (as was done before) for jobs that have very different dependencies results in a suboptimal cache for some of the jobs, as the way the cache action seems to work is that the cache for a given id will only be created once, and by the job that happens to finish first. So if we use a shared cache for test and community_build, and the test job finishes first, it will be the one to create the cache entry; the community_build job by virtue of finishing later will have its cache discarding rather than updating the shared cache id, and so subsequent runs of community_build will use the test job cache which is missing a lot.

I'm rambling off the cuff, my apologies. I will work on a better (and hopefully clearer) write up later tonight after some further test runs and looking over my notes. But my local testing over the past few days leads me to be optimistic that we should see a marked improvement in CI performance (at least on Linux -- the Windows bottleneck is a different story of course).

smarter · 2020-11-06T01:10:57Z

it will still occasionally hang for 10+ minutes while downloading the cache tarball from GitHub.

That's weird, but I don't think it's actually downloading things from github: we run the github runners on our own machines, and the caches get created on these machines, so all it should be doing is copying that cache in the docker container created for the current job and unpacking it there, if that takes 10 minutes something is definitely wrong.

griggt · 2020-11-06T01:34:50Z

The caches are stored on GitHub servers, even for self-hosted runners. This really surprised me. I thought it worked the way you describe.

The reason the cache steps right now are so slow occasionally is that the tarball download gets stuck, or proceeds at a glacial pace, as the GitHub cache endpoint stops sending data. There are several issues on the GitHub actions issue tracker confirming this is the issue and the fix is included in the v2 upgrade, and I saw the same thing when I set up self-hosted runners locally and had all sorts of network/disk/etc monitors running.

The v1 cache actually did not delete the local tarballs so that may be what you're seeing.

The v2 cache action displays download progress, so when it does get stuck, it shows up in the logs.

Oh, and there is a 5GB per-repository quota and 7-day limit for the caches before they get evicted.

smarter · 2020-11-06T01:39:04Z

The v1 cache actually did not delete the local tarballs so that may be what you're seeing.

Ah yes probably. So if there's no way to bypass this behavior, maybe we should get rid of actions/cache and write our own action to cache things locally if that doesn't exist yet?

griggt · 2020-11-06T01:46:55Z

Are you volunteering? :-)

I had the same thought many days ago when I started looking into this, but didn't want to invest an (even larger) chunk of time and looked for a simpler (pre-made) solution.

After digging through the issue tracker though, I'm surprised the thing works at all. It seems to be far from what I would call "mature".

griggt · 2020-11-06T01:47:58Z

I think I might have seen that someone cobbled something together that uses AWS S3 for storage. I'll try to dig that up as a additional point of reference.

smarter · 2020-11-06T02:09:05Z

I can't figure out if there's even a way to bind a volume in the container to use as a cache, maybe we could run rsyncd on the server, pass the local IP to the container and connect to that?

griggt · 2020-11-06T05:43:06Z

The GitHub Actions docs (https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-syntax-for-github-actions#jobsjob_idcontainer) seem to indicate that it is possible to pass docker options and additional volumes to bind as container parameters in the workflow file.

A community support message (https://github.community/t/docker-action-cant-set-docker-options/17343) from January indicates that the 'volumes' option may not (yet?) work, but that it was possible to pass the volume bind options directly to docker using the 'options' parameter.

Certainly not the most elegant solution, having to cut/paste the raw volume bind options everywhere but it might work.

griggt · 2020-11-06T07:21:24Z

Initial testing is looking good. I ran the workflow 12 times tonight, often 2 or 3 instances concurrently.

Cache restoration (download and extraction of the tarball from GitHub) is now consistently varying between 10 and 60 seconds depending on the size of the tarball. I have yet to see the cache download stall out or fail. Overall workflow run time is typically 35 minutes or less.

Still to do:

address caching for nightly scheduled jobs & release jobs
coherent write-up of the issues, changes, and cache management strategy
clean up and ready for review/merge

The existing nightly jobs have caching steps defined in the workflow file but they are basically NOPs since the v1 cache action does not support running on 'schedule' events (this restriction is lifted in v2).

To me this already looks like a big improvement over the status quo, and at the least will buy us time to explore alternative solutions that don't require downloading the cache tarballs from GitHub on every run.

griggt · 2020-11-07T23:41:43Z

if there's even a way to bind a volume in the container to use as a cache

I performed a quick experiment:

   test:
     runs-on: [self-hosted, Linux]
-    container: lampepfl/dotty:2020-04-24
+    container:
+      image: lampepfl/dotty:2020-04-24
+      volumes:
+        - ci-cache-sbt:/root/.sbt
+        - ci-cache-ivy:/root/.ivy2/cache
+        - ci-cache-general:/root/.cache

which was successful and demonstrated that it is possible to use Docker named volumes to create persistent, per-host, locally stored caches, should we want to proceed in that direction (or some variant thereof).

@smarter what do you think?

smarter · 2020-11-08T00:13:32Z

Nice! So here ci-cache-sbt would be the name of a docker volume created on the host, right? If that works that sounds great. One thing I worry about is concurrent modifications to the cache since we have multiple runners on the same machine, but I don't see a way to access the name of the runner from the workflow config file, and sbt/coursier are supposed to be able to handle concurrent accesses correctly, so we could just try it and see if it's a problem in practice. It seems that docker will automatically create a named volume on the host if it doesn't exist so it sounds like you could experiment with this in this PR without any config change to the hosts?

griggt · 2020-11-08T00:29:09Z

So here ci-cache-sbt would be the name of a docker volume created on the host, right?

Exactly.

It seems that docker will automatically create a named volume on the host if it doesn't exist so it sounds like you could experiment with this in this PR without any config change to the hosts?

Indeed. I actually did the proof-of-concept test by opening a PR on one of my own repos without having created the docker volumes, precisely to test that this functionality works.

Of course I cannot delete any volumes that are created on the EPFL runners, but I presume that you or someone else can take care of any cleanup that becomes necessary.

One thing I worry about is concurrent modifications to the cache since we have multiple runners on the same machine

Yeah, I thought that might be the case. I'll play around with it a bit more on my end by adding more local runners and try to root out any concurrency issues.

I don't see a way to access the name of the runner from the workflow config file

I'll do some more digging on this too.

griggt · 2020-11-08T06:08:03Z

sbt/coursier are supposed to be able to handle concurrent accesses correctly

Well, unfortunately it didn't take long to demonstrate that this doesn't seem to be the case. It failed the first test.

I set up two self-hosted runners and created fresh, empty docker volumes shared between them for the caches, as discussed above. The workflow consisted of two copies of the test job, one for each runner.

Neither job was able to finish loading the sbt project definition before failing.

It looks like the initial sbt startup and caching (in ~/.sbt and ~/.ivy2) was ok (serialized by a lock file?), but once coursier got involved things went sideways. Several of the jars in the coursier cache ended up corrupt and failed checksums (oversize, having been written to from both runners concurrently it seems), and there were many other errors as well.

The logs from each runner:

runner-1.txt
runner-2.txt

I repeated the test and it failed once again, in a similar yet possibly slightly different manner.

griggt · 2020-11-08T06:33:41Z

I found that it is possible to create docker mounts that are isolated per-runner, but it is a touch hacky:

  test:
    runs-on: [self-hosted, Linux]
    container:
      image: lampepfl/dotty:2020-04-24
      volumes:
        - ${{ github.workspace }}/../cache/sbt:/root/.sbt
        - ${{ github.workspace }}/../cache/ivy:/root/.ivy2/cache
        - ${{ github.workspace }}/../cache/general:/root/.cache

where ${{ github.workspace }} expands to:

/home/drone/github-actions-runners/runner-2/_work/dotty/dotty
                                   ^^^^^^^^
                                   differs for each runner

which results in new directories being created on the host and used as the source for bind mounts at:

/home/drone/github-actions-runners/runner-2/_work/dotty/cache/sbt
/home/drone/github-actions-runners/runner-2/_work/dotty/cache/ivy
/home/drone/github-actions-runners/runner-2/_work/dotty/cache/general

The ../ is necessary to avoid the cache landing in the root directory of the checked-out repo.

One potential downside of this approach is that the the cache dir is also visible in the container at /__w/dotty/cache, since /home/drone/github-actions-runners/runner-2/_work is automatically bind mounted at /__w.

I'm also not sure how much this approach relies on implementation defined behavior that may be subject to change.

There don't seem to be other options as far as context variables go that allow distinguishing one runner from another in the workflow file. It would be nice if there was a ${{ runner.name }}, but that notwithstanding, for some unknown reason the entire runner context does not seem to be in scope for use in the container volumes section of the workflow file, so it wouldn't do much good anyway.

In very limited testing, however, this solution does work so far, and provides locally stored, isolated per-runner, persistent cache.

smarter · 2020-11-08T14:09:00Z

One potential downside of this approach is that the the cache dir is also visible in the container at /__w/dotty/cache, since /home/drone/github-actions-runners/runner-2/_work is automatically bind mounted at /__w.

Not a big deal, but perhaps we can create these directories two level up, in ${{ github.workspace }}/../../ ? Also since those aren't docker named volumes I imagine I have to go mkdir them?

griggt · 2020-11-09T22:27:40Z

One minor issue with using bind mounts rather than named volumes is that they obscure the existing contents of the target directory in the container. This means that the /root/.sbt/repositories file (for the artifactory proxy) from the Docker image is no longer visible since we mount the cache on top of /root/.sbt.

I have worked around this for now by adding a copy of the repositories file to the PR and a step in every CI job to copy it into place. However I think a cleaner solution would be to update the docker image to store the repositories file in a different location and point to it using -Dsbt.repository.config in the global sbt config file of the image. I can open a PR in lampepfl/dotty-ci to do this if so desired.

Other than that, testing went well. I will write up a summary for review so we can decide on a path forward.

smarter · 2020-11-09T22:34:16Z

However I think a cleaner solution would be to update the docker image to store the repositories file in a different location and point to it using -Dsbt.repository.config in the global sbt config file of the image.

One problem with that is that Java properties are not propagated when spawning process and we spawn sbt processes in the community build for example. So the solution based on copying the repositories file from the git repo actually sounds pretty good to me, it also makes it easier to tweak the repositories file if needed without having to regenerate the docker image.

griggt · 2020-11-09T22:39:34Z

So the solution based on copying the repositories file from the git repo actually sounds pretty good to me, it also makes it easier to tweak the repositories file if needed without having to regenerate the docker image.

Okay, will leave it as is then. We should probably delete the repositories file from the docker image if we merge it into the repo here, to avoid any confusion about which one is in use.

smarter · 2020-11-09T22:43:25Z

Other than that, testing went well. I will write up a summary for review so we can decide on a path forward.

Great to hear that, but by the way, have you checked whether this worked for the Windows jobs too?

griggt · 2020-11-09T23:04:38Z

have you checked whether this worked for the Windows jobs too?

I have not. Are the Windows jobs running in a Docker container? I don't see one defined in the workflow file.

smarter · 2020-11-09T23:15:32Z

oh, I guess it runs directly on the host so there's no special cache handling needed: https://github.com/lampepfl/dotty/blob/ca216a851037369fd83fd1bc31f3331a1a84424d/.github/workflows/ci.yaml#L119-L134

griggt · 2020-11-09T23:20:23Z

I guess it runs directly on the host

Sounds like a potential security vulnerability in the making, but I think that's out of scope for this PR.

smarter · 2020-11-09T23:36:05Z

Sounds like a potential security vulnerability in the making

We're aware but Github Actions apparently doesn't support Windows containers so we just took some remedial steps (the machine is in its own network and can't access anything else)

griggt · 2020-11-10T00:23:44Z

Solutions explored in this PR, and some pros/cons of each:

1. Upgrade to GitHub actions/cache@v2 + use more refined hash keys

This newer version of actions/cache has improvements to mitigate the known issues with cache restore being intermittently slow and occasionally failing, as well as other improvements, such as being able to use the cache for scheduled jobs. Refining the set of cache keys in use permits smaller, more focused per-job caches, and helps in caching more artifacts than we do currently.

Pros:

Least radical change compared to the status quo.
All runners have the same view of cache.
Hierarchical cache (feature branches and PRs have their own cache scope, etc.)

Cons:

All cache data is still stored on GitHub servers and cache tarballs must be downloaded at the start of each job that uses cache. Each tarball is several hundred MB.
The improvements to actions/cache@v2 don't completely solve the issues, and GitHub cache endpoints still occasionally stop sending data, delaying cache restoration, sometimes 20+ minutes.
Quota of 5GB cache per repository.
Cache is evicted after 7 days of non-access.
No mechanism to inspect which caches currently exist on GitHub servers, nor their size or contents.
Caches are read-only and can only be "updated" by using a new cache key, effectively creating a new cache. This makes it difficult to create caches that can be shared among jobs that may have different dependencies with a large common subset.
Management of cache keys to properly capture distinct caches and keep cache tarballs reasonably sized can be tricky.

2. Eliminate actions/cache, instead use Docker named volume mounts (per-host local cache)

This involves creating Docker named volumes for each cache on each runner host, and mounting them at the appropriate locations in the container filesystem. The cache is stored locally on each host, and shared among all of its runners.

Pros:

No network traffic to/from GitHub to save/restore cache.
No CPU/IO overhead to save/restore cache, since it is persistent.
All runners on a host see cache updates simultaneously.
Easy to implement in the worklow file.

Cons:

Cache updates on one host will not be visible to the others.
There does not appear to be a way to create per-runner volume names in the workflow file, so this option is limited to per-host caching, which may have concurrency issues.
PRs could pollute the cache (mostly an issue of wasted storage, I suppose)

This option turns out to be a non-starter as it was discovered experimentally that concurrent updates to the coursier cache may result in corruption.

3. Eliminate actions/cache, instead use Docker bind mounts (per-runner local cache)

Similar to solution 2 above, but using bind mounts rather than named volumes, and isolating the caches on a per-runner rather than per-host basis. We use bind mounts rather than named volumes due to limitations around distinguishing runners in the GitHub actions workflow file.

Pros:

No network traffic to/from GitHub to save/restore cache.
No CPU/IO overhead to save/restore cache, since it is persistent.
Concurrency issues are eliminated compared to solution 2.
Easy to implement in the worklow file.

Cons:

Additional storage needed.
Cache updates on one runner will not be visible to the others.
Bind mounts obscure the existing contents of the target directory (minor issue w/workaround).
PRs could pollute the cache (mostly an issue of wasted storage, I suppose)
Makes a few (reasonable) assumptions about the GitHub Actions runner implementation.

Sizes of the local persistent caches after population via testing (may grow after other jobs run such as publish/release):

Cache	Size
sbt	~166 MB
ivy	~80 MB
coursier/mill	~1.3 GB

griggt · 2020-11-10T01:24:59Z

Aside: should we be caching the contents of ~/.mill? We aren't currently...

smarter · 2020-11-10T01:30:39Z

Is there anything that gets put in .mill and not in .cache/mill these days?

griggt · 2020-11-10T01:41:27Z

After running community_build_a:

$ du -hd2 /root
36M	/root/.ivy2/local
80M	/root/.ivy2/cache
115M	/root/.ivy2
1.1G	/root/.cache/coursier
8.0K	/root/.cache/JNA
1.1G	/root/.cache
6.0M	/root/.mill/ammonite
6.0M	/root/.mill
97M	/root/.sbt/boot
1.5M	/root/.sbt/1.0
68M	/root/.sbt/0.13
166M	/root/.sbt
1.4G	/root

Not sure what's in there.

smarter · 2020-11-10T01:47:30Z

Seems to be a cache for compiled ammonite scripts (in our case, probably just from build files in community-build projects using mill), not worth caching.

This enables per-runner, local, persistent cache for CI.

Docker bind mounts obscure the existing contents of a directory, so the /root/.sbt/repositories file from the Docker image is no longer visible after we bind mount on top of /root/.sbt

smarter

Many thanks for investigating and fixing this!

This file has been incorporated into the Dotty repository by scala/scala3#10197.

This file has been incorporated into the Dotty repository and GitHub Actions workflow by scala/scala3#10197.

griggt · 2020-11-11T01:43:20Z

I opened lampepfl/dotty-ci#18 to remove the repositories file.

dottybot reviewed Nov 5, 2020

View reviewed changes

griggt force-pushed the ci-caching branch from 23f3774 to 19dc490 Compare November 6, 2020 00:28

griggt mentioned this pull request Nov 6, 2020

Partition community build for faster CI #10030

Merged

griggt force-pushed the ci-caching branch from 19dc490 to 52c5b12 Compare November 6, 2020 01:08

smarter closed this Nov 6, 2020

smarter reopened this Nov 6, 2020

griggt force-pushed the ci-caching branch 5 times, most recently from edbb82c to 7a9cd68 Compare November 6, 2020 04:31

griggt force-pushed the ci-caching branch 3 times, most recently from dc9fc87 to 2f24952 Compare November 6, 2020 06:34

griggt force-pushed the ci-caching branch 2 times, most recently from 534ce6a to c31d202 Compare November 9, 2020 04:22

griggt added 3 commits November 10, 2020 13:16

Fix reference to JDK being used

509e39e

Fix scala#10273: replace GitHub actions/cache with Docker bind mounts

7ee1f1f

This enables per-runner, local, persistent cache for CI.

Use SBT proxy repositories in CI

67fc8cf

Docker bind mounts obscure the existing contents of a directory, so the /root/.sbt/repositories file from the Docker image is no longer visible after we bind mount on top of /root/.sbt

griggt force-pushed the ci-caching branch from c31d202 to 67fc8cf Compare November 10, 2020 21:57

griggt changed the title ~~Investigate CI caching issues~~ Fix #10273: use Docker bind mounts for CI caching Nov 10, 2020

griggt changed the title ~~Fix #10273: use Docker bind mounts for CI caching~~ Fix #10273: Use Docker bind mounts for CI caching Nov 10, 2020

griggt marked this pull request as ready for review November 10, 2020 23:03

smarter approved these changes Nov 10, 2020

View reviewed changes

smarter merged commit 306e4fc into scala:master Nov 10, 2020

griggt deleted the ci-caching branch November 10, 2020 23:44

griggt added a commit to griggt/dotty-ci that referenced this pull request Nov 11, 2020

Remove SBT proxy repositories file

e8ca93f

This file has been incorporated into the Dotty repository by scala/scala3#10197.

griggt added a commit to griggt/dotty-ci that referenced this pull request Nov 11, 2020

Remove SBT proxy repositories file

aa49f13

This file has been incorporated into the Dotty repository and GitHub Actions workflow by scala/scala3#10197.

griggt mentioned this pull request Nov 11, 2020

Remove SBT proxy repositories file lampepfl/dotty-ci#18

Merged

romanowski pushed a commit to romanowski/dotty that referenced this pull request Nov 11, 2020

Merge pull request scala#10197 from griggt/ci-caching

17c7a46

Fix #10273: Use Docker bind mounts for CI caching #10197

Fix #10273: Use Docker bind mounts for CI caching #10197

Uh oh!

Conversation

griggt commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dottybot left a comment

Choose a reason for hiding this comment

Uh oh!

smarter commented Nov 6, 2020

Uh oh!

griggt commented Nov 6, 2020

Uh oh!

smarter commented Nov 6, 2020

Uh oh!

griggt commented Nov 6, 2020

Uh oh!

smarter commented Nov 6, 2020

Uh oh!

griggt commented Nov 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

griggt commented Nov 6, 2020

Uh oh!

smarter commented Nov 6, 2020

Uh oh!

griggt commented Nov 6, 2020

Uh oh!

griggt commented Nov 6, 2020

Uh oh!

griggt commented Nov 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smarter commented Nov 8, 2020

Uh oh!

griggt commented Nov 8, 2020

Uh oh!

griggt commented Nov 8, 2020

Uh oh!

griggt commented Nov 8, 2020

Uh oh!

smarter commented Nov 8, 2020

Uh oh!

griggt commented Nov 9, 2020

Uh oh!

smarter commented Nov 9, 2020

Uh oh!

griggt commented Nov 9, 2020

Uh oh!

smarter commented Nov 9, 2020

Uh oh!

griggt commented Nov 9, 2020

Uh oh!

smarter commented Nov 9, 2020

Uh oh!

griggt commented Nov 9, 2020

Uh oh!

smarter commented Nov 9, 2020

Uh oh!

griggt commented Nov 10, 2020

Uh oh!

griggt commented Nov 10, 2020

Uh oh!

smarter commented Nov 10, 2020

Uh oh!

griggt commented Nov 10, 2020

Uh oh!

smarter commented Nov 10, 2020

Uh oh!

smarter left a comment

Choose a reason for hiding this comment

Uh oh!

griggt commented Nov 11, 2020

Uh oh!

Uh oh!

griggt commented Nov 5, 2020 •

edited

Loading

griggt commented Nov 6, 2020 •

edited

Loading

griggt commented Nov 7, 2020 •

edited

Loading