Skip to content

Commit 2deca50

Browse files
committed
crfs: start of a README / design doc of sorts
Updates golang/go#30829 Change-Id: I8790dfcd30e3fb4d68b6e4cb9f8baf44c45d2cd6 Reviewed-on: https://go-review.googlesource.com/c/build/+/167392 Reviewed-by: Brad Fitzpatrick <[email protected]>
1 parent 3bfcc9b commit 2deca50

File tree

1 file changed

+143
-0
lines changed

1 file changed

+143
-0
lines changed

crfs/README.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# CRFS: Container Registry Filesystem
2+
3+
Discussion: https://github.com/golang/go/issues/30829
4+
5+
## Overview
6+
7+
**CRFS** is a read-only FUSE filesystem that lets you mount a
8+
container image, served directly from a container registry (such as
9+
[gcr.io](https://gcr.io/)), without pulling it all locally first.
10+
11+
## Background
12+
13+
Starting a container should be fast. Currently, however, starting a
14+
container in many environments requires doing a `pull` operation from
15+
a container registry to read the entire container image from the
16+
registry and write the entire container image to the local machine's
17+
disk. It's pretty silly (and wasteful) that a read operation becomes a
18+
write operation. For small containers, this problem is rarely noticed.
19+
For larger containers, though, the pull operation quickly becomes the
20+
slowest part of launching a container, especially on a cold node.
21+
Contrast this with launching a VM on major cloud providers: even with
22+
a VM image that's hundreds of gigabytes, the VM boots in seconds.
23+
That's because the hypervisors' block devices are reading from the
24+
network on demand. The cloud providers all have great internal
25+
networks. Why aren't we using those great internal networks to read
26+
our container images on demand?
27+
28+
## Why does Go want this?
29+
30+
Go's continuous build system tests Go on [many operating systems and
31+
architectures](https://build.golang.org/), using a mix of containers
32+
(mostly for Linux) and VMs (for other operating systems). We
33+
prioritize fast builds, targetting 5 minute turnaround for pre-submit
34+
tests when testing new changes. For isolation and other reasons, we
35+
run all our containers in a single-use fresh VMs. Generally our
36+
containers do start quickly, but some of our containers are very large
37+
and take a long time to start. To work around that, we've automated
38+
the creation of VM images where our heavy containers are pre-pulled.
39+
This is all a silly workaround. It'd be much better if we could just
40+
read the bytes over the network from the right place, without the all
41+
the hoops.
42+
43+
## Tar files
44+
45+
One reason that reading the bytes directly from the source on demand
46+
is somewhat non-trivial is that container images are, somewhat
47+
regrettably, represented by *tar.gz* files, and tar files are
48+
unindexed, and gzip streams are not seekable. This means that trying
49+
to read 1KB out of a file named `/var/lib/foo/data` still involves
50+
pulling hundreds of gigabytes to uncompress the stream, to decode the
51+
entire tar file until you find the entry you're looking for. You can't
52+
look it up by its path name.
53+
54+
## Introducing Stargz
55+
56+
Fortunately, we can fix the fact that *tar.gz* files are unindexed and
57+
unseekable, while still making the file a valid *tar.gz* file by
58+
taking advantage of the properties of both tar files and gzip
59+
compression in that you can concatenate tar files together to make
60+
valid tar files, and you can concatenate multiple gzip streams
61+
together and have a valid gzip stream.
62+
63+
We introduce a format, **Stargz**, a **S**eekable
64+
**tar.gz** format that's still a valid tar.gz file for everything else
65+
that's unaware of these details.
66+
67+
In summary:
68+
69+
* That traditional `*.tar.gz` format is: `GZIP(TAR(file1 + file2 + file3))`
70+
* Stargz's format is: `GZIP(TAR(file1)) + GZIP(TAR(file2)) + GZIP(TAR(file3_chunk1)) + GZIP(TAR(file3_chunk2)) + GZIP(TAR(index of earlier files in magic file))`, where the trailing ZIP-like index contains offsets for each file/chunk's GZIP header in the overall **stargz** file.
71+
72+
This makes images a few percent larger (due to more gzip headers and
73+
loss of compression context between files), but it's plenty
74+
acceptable.
75+
76+
## Converting images
77+
78+
If you're using `docker push` to push to a registry, you can't use
79+
CRFS to mount the image. Maybe one day `docker push` will push
80+
*stargz* files (or something with similar properties) by default, but
81+
not yet. So for now we need to convert the storage image layers from
82+
*tar.gz* into *stargz*. There is a tool that does that. **TODO: examples**
83+
84+
## Operation
85+
86+
When mounting an image, the FUSE filesystem makes does a couple Docker
87+
Registry HTTP API requests to the container registry to get the
88+
metadata for the container and all its layers.
89+
90+
It then does HTTP Range requests to read just the **stargz** index out
91+
of the end of each of the layers. The index is stored similar to how
92+
the ZIP format's TOC is stored, storing a pointer to the index at the
93+
very end of the file. Generally it takes 1 HTTP request to read the
94+
index, but no more than 2. In any case, we're assuming a fast network
95+
(GCE VMs to gcr.io, or similar) with low latency to the container
96+
registry. Each layer needs these 1 or 2 HTTP requests, but they can
97+
all be done in parallel.
98+
99+
From that, we keep the index in memory, so `readdir`, `stat`, and
100+
friends are all served from memory. For reading data, the index
101+
contains the offset of each file's `GZIP(TAR(file data))` range of the
102+
overall *stargz* file. To make it possible to efficiently read a small
103+
amount of data from large files, there can actually be multiple
104+
**stargz** index entries for large files. (e.g. a new gzip stream
105+
every 16MB of a large file).
106+
107+
## Union/overlay filesystems
108+
109+
CRFS can do the aufs/overlay2-ish unification of multiple read-only
110+
*stargz* layers, but it will stop short of trying to unify a writable
111+
filesystem layer atop. For that, you can just use the traditional
112+
Linux filesystems.
113+
114+
## Using with Docker, without modifying Docker
115+
116+
Ideally container runtimes would support something like this whole
117+
scheme natively, but in the meantime a workaround is that when
118+
converting an image into *stargz* format, the converter tool can also
119+
produce an image variant that only has metadata (environment,
120+
entrypoints, etc) and no file contents. Then you can bind mount in the
121+
contents from the CRFS FUSE filesystem.
122+
123+
That is, the convert tool can do:
124+
125+
**Input**: `gcr.io/your-proj/container:v2`
126+
127+
**Output**: `gcr.io/your-proj/container:v2meta` + `gcr.io/your-proj/container:v2stargz`
128+
129+
What you actually run on Docker or Kubernetes then is the `v2meta`
130+
version, so your container host's `docker pull` or equivalent only
131+
pulls a few KB. The gigabytes of remaining data is read lazily via
132+
CRFS from the `v2stargz` layer directly from the container registry.
133+
134+
## Status
135+
136+
WIP. Enough parts are implemented & tested for me to realize this
137+
isn't crazy. I'm publishing this document first for discussion while I
138+
finish things up. Maybe somebody will point me to an existing
139+
implementation, which would be great.
140+
141+
## Discussion
142+
143+
See https://github.com/golang/go/issues/30829

0 commit comments

Comments
 (0)