|
| 1 | +# CRFS: Container Registry Filesystem |
| 2 | + |
| 3 | +Discussion: https://github.com/golang/go/issues/30829 |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +**CRFS** is a read-only FUSE filesystem that lets you mount a |
| 8 | +container image, served directly from a container registry (such as |
| 9 | +[gcr.io](https://gcr.io/)), without pulling it all locally first. |
| 10 | + |
| 11 | +## Background |
| 12 | + |
| 13 | +Starting a container should be fast. Currently, however, starting a |
| 14 | +container in many environments requires doing a `pull` operation from |
| 15 | +a container registry to read the entire container image from the |
| 16 | +registry and write the entire container image to the local machine's |
| 17 | +disk. It's pretty silly (and wasteful) that a read operation becomes a |
| 18 | +write operation. For small containers, this problem is rarely noticed. |
| 19 | +For larger containers, though, the pull operation quickly becomes the |
| 20 | +slowest part of launching a container, especially on a cold node. |
| 21 | +Contrast this with launching a VM on major cloud providers: even with |
| 22 | +a VM image that's hundreds of gigabytes, the VM boots in seconds. |
| 23 | +That's because the hypervisors' block devices are reading from the |
| 24 | +network on demand. The cloud providers all have great internal |
| 25 | +networks. Why aren't we using those great internal networks to read |
| 26 | +our container images on demand? |
| 27 | + |
| 28 | +## Why does Go want this? |
| 29 | + |
| 30 | +Go's continuous build system tests Go on [many operating systems and |
| 31 | +architectures](https://build.golang.org/), using a mix of containers |
| 32 | +(mostly for Linux) and VMs (for other operating systems). We |
| 33 | +prioritize fast builds, targetting 5 minute turnaround for pre-submit |
| 34 | +tests when testing new changes. For isolation and other reasons, we |
| 35 | +run all our containers in a single-use fresh VMs. Generally our |
| 36 | +containers do start quickly, but some of our containers are very large |
| 37 | +and take a long time to start. To work around that, we've automated |
| 38 | +the creation of VM images where our heavy containers are pre-pulled. |
| 39 | +This is all a silly workaround. It'd be much better if we could just |
| 40 | +read the bytes over the network from the right place, without the all |
| 41 | +the hoops. |
| 42 | + |
| 43 | +## Tar files |
| 44 | + |
| 45 | +One reason that reading the bytes directly from the source on demand |
| 46 | +is somewhat non-trivial is that container images are, somewhat |
| 47 | +regrettably, represented by *tar.gz* files, and tar files are |
| 48 | +unindexed, and gzip streams are not seekable. This means that trying |
| 49 | +to read 1KB out of a file named `/var/lib/foo/data` still involves |
| 50 | +pulling hundreds of gigabytes to uncompress the stream, to decode the |
| 51 | +entire tar file until you find the entry you're looking for. You can't |
| 52 | +look it up by its path name. |
| 53 | + |
| 54 | +## Introducing Stargz |
| 55 | + |
| 56 | +Fortunately, we can fix the fact that *tar.gz* files are unindexed and |
| 57 | +unseekable, while still making the file a valid *tar.gz* file by |
| 58 | +taking advantage of the properties of both tar files and gzip |
| 59 | +compression in that you can concatenate tar files together to make |
| 60 | +valid tar files, and you can concatenate multiple gzip streams |
| 61 | +together and have a valid gzip stream. |
| 62 | + |
| 63 | +We introduce a format, **Stargz**, a **S**eekable |
| 64 | +**tar.gz** format that's still a valid tar.gz file for everything else |
| 65 | +that's unaware of these details. |
| 66 | + |
| 67 | +In summary: |
| 68 | + |
| 69 | +* That traditional `*.tar.gz` format is: `GZIP(TAR(file1 + file2 + file3))` |
| 70 | +* Stargz's format is: `GZIP(TAR(file1)) + GZIP(TAR(file2)) + GZIP(TAR(file3_chunk1)) + GZIP(TAR(file3_chunk2)) + GZIP(TAR(index of earlier files in magic file))`, where the trailing ZIP-like index contains offsets for each file/chunk's GZIP header in the overall **stargz** file. |
| 71 | + |
| 72 | +This makes images a few percent larger (due to more gzip headers and |
| 73 | +loss of compression context between files), but it's plenty |
| 74 | +acceptable. |
| 75 | + |
| 76 | +## Converting images |
| 77 | + |
| 78 | +If you're using `docker push` to push to a registry, you can't use |
| 79 | +CRFS to mount the image. Maybe one day `docker push` will push |
| 80 | +*stargz* files (or something with similar properties) by default, but |
| 81 | +not yet. So for now we need to convert the storage image layers from |
| 82 | +*tar.gz* into *stargz*. There is a tool that does that. **TODO: examples** |
| 83 | + |
| 84 | +## Operation |
| 85 | + |
| 86 | +When mounting an image, the FUSE filesystem makes does a couple Docker |
| 87 | +Registry HTTP API requests to the container registry to get the |
| 88 | +metadata for the container and all its layers. |
| 89 | + |
| 90 | +It then does HTTP Range requests to read just the **stargz** index out |
| 91 | +of the end of each of the layers. The index is stored similar to how |
| 92 | +the ZIP format's TOC is stored, storing a pointer to the index at the |
| 93 | +very end of the file. Generally it takes 1 HTTP request to read the |
| 94 | +index, but no more than 2. In any case, we're assuming a fast network |
| 95 | +(GCE VMs to gcr.io, or similar) with low latency to the container |
| 96 | +registry. Each layer needs these 1 or 2 HTTP requests, but they can |
| 97 | +all be done in parallel. |
| 98 | + |
| 99 | +From that, we keep the index in memory, so `readdir`, `stat`, and |
| 100 | +friends are all served from memory. For reading data, the index |
| 101 | +contains the offset of each file's `GZIP(TAR(file data))` range of the |
| 102 | +overall *stargz* file. To make it possible to efficiently read a small |
| 103 | +amount of data from large files, there can actually be multiple |
| 104 | +**stargz** index entries for large files. (e.g. a new gzip stream |
| 105 | +every 16MB of a large file). |
| 106 | + |
| 107 | +## Union/overlay filesystems |
| 108 | + |
| 109 | +CRFS can do the aufs/overlay2-ish unification of multiple read-only |
| 110 | +*stargz* layers, but it will stop short of trying to unify a writable |
| 111 | +filesystem layer atop. For that, you can just use the traditional |
| 112 | +Linux filesystems. |
| 113 | + |
| 114 | +## Using with Docker, without modifying Docker |
| 115 | + |
| 116 | +Ideally container runtimes would support something like this whole |
| 117 | +scheme natively, but in the meantime a workaround is that when |
| 118 | +converting an image into *stargz* format, the converter tool can also |
| 119 | +produce an image variant that only has metadata (environment, |
| 120 | +entrypoints, etc) and no file contents. Then you can bind mount in the |
| 121 | +contents from the CRFS FUSE filesystem. |
| 122 | + |
| 123 | +That is, the convert tool can do: |
| 124 | + |
| 125 | +**Input**: `gcr.io/your-proj/container:v2` |
| 126 | + |
| 127 | +**Output**: `gcr.io/your-proj/container:v2meta` + `gcr.io/your-proj/container:v2stargz` |
| 128 | + |
| 129 | +What you actually run on Docker or Kubernetes then is the `v2meta` |
| 130 | +version, so your container host's `docker pull` or equivalent only |
| 131 | +pulls a few KB. The gigabytes of remaining data is read lazily via |
| 132 | +CRFS from the `v2stargz` layer directly from the container registry. |
| 133 | + |
| 134 | +## Status |
| 135 | + |
| 136 | +WIP. Enough parts are implemented & tested for me to realize this |
| 137 | +isn't crazy. I'm publishing this document first for discussion while I |
| 138 | +finish things up. Maybe somebody will point me to an existing |
| 139 | +implementation, which would be great. |
| 140 | + |
| 141 | +## Discussion |
| 142 | + |
| 143 | +See https://github.com/golang/go/issues/30829 |
0 commit comments