In 2016 when we released Docker for Mac (now Docker Desktop), we had a problem. The disk space used by Docker for Mac grew and grew, up to around 64GiB. At the time this was a huge amount of disk space, especially given the limited disk space on typical Mac laptops. Worse still, the disk space used never shrank. No matter how many containers and images were deleted the disk space used increased relentlessly. Read on to understand how we got here and how we fixed it!

First, some background

Docker for Mac runs Linux containers on Mac. However Linux containers need the Linux syscall interface, not the macOS syscall interface. Therefore Docker for Mac includes a small, hidden Linux virtual machine. The virtualized Linux kernel provides the syscall interface for containers, while Docker for Mac provides the virtual disk, virtual network, and other services underneath Linux:

How does this affect disk space?

First let's investigate what happens when the developer modifies files inside a container. The following command creates a container, creates a file inside it, and then removes the whole container (using --rm):

% docker run --rm alpine touch /foo

To discover more about where the files are created, we can start another container with an interative shell:

% docker run --rm -it alpine sh

/ # mount
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/desktop-containerd/daemon/io.containerd.snapshotter.v1.overlayfs/snapshots/112/fs:/var/lib/desktop-containerd/daemon/io.containerd.snapshotter.v1.overlayfs/snapshots/1/fs,upperdir=/var/lib/desktop-containerd/daemon/io.containerd.snapshotter.v1.overlayfs/snapshots/113/fs,workdir=/var/lib/desktop-containerd/daemon/io.containerd.snapshotter.v1.overlayfs/snapshots/113/work)
...

According to mount, the root filesystem of the container is an overlay filesystem. An overlay allows filesystems to be layered, so that writes go to the "upperdir", while reads fall through to the "lowerdir". overlay is great for containers because a base image (in this case alpine) can be shared -- read only -- as the "lowerdir", while each container gets to make its own changes via a custom writable layer on top. Without overlay, every container start would have to physically copy the base image, which would be slow and consume a lot of disk space unnecessarily.

In this example, the container's writes go to the "upperdir", which is /var/lib/desktop-containerd/daemon/io.containerd.snapshotter.v1.overlayfs/snapshots/113/fs. What type of filesystem is that? To find the answer we need to break into the layer beneath the container, and see what the Docker engine itself sees. If we run

% docker run --rm --privileged --pid=host -it justincormack/nsenter1

This command will use the "host" pid namespace (so it can see all processes, not only the processes inside the container) and use nsenter to enter all the namepaces of pid 1, the root process in the VM. Now we can dig further:

sh-5.2# df /var/lib/desktop-containerd/daemon/io.containerd.snapshotter.v1.overlayfs/snapshots/113/fs
Filesystem      1K-blocks     Used Available Use% Mounted on
/dev/sdd       1055762868 10479984 991579412   2% /var/lib
sh-5.2# mount | grep "/var/lib "
/dev/vda /var/lib type ext4 (rw,relatime)

We can see that the container's writes are written to an ext4 filesystem on a disk called /dev/vda, where /dev/vda is a virtual disk provided by Docker Desktop.

Now we are getting close to seeing the problem: when a user creates and then removes a file it results in a series of virtual disk reads and writes in Docker Desktop. The Mac doesn't see the Linux file creations or deletions, instead the Mac sees reads and writes of disk blocks. How are these reads and writes implemented?

The simplest way to implement a virtual disk of size 64GiB (the Docker for Mac default) is to create a single 64GiB file on the Mac, where each block is stored sequentially. Today this is actually sensible, because modern filesystems like APFS support "sparse files" (or "thin-provisioning") which only allocate real storage when it is needed. However back in 2016 the most common filesystem on the Mac was HFS+, which lacks this feature. If we had used a raw file on HFS+ then Docker would always consume 64GB on the host nomatter how much data was actually stored in the containers! Obviously we didn't do that, but what did we do instead?

To simulate sparse files on HFS+, we used a disk format called qcow2, from the QEMU project. Docker for Mac used lots of libraries from the Mirage unikernel project, so we used their OCaml qcow2 implementation and linked it directly to hyperkit, the code which implements the virtual disk and network devices. Now when Docker for Mac started up, a tiny empty qcow2 file was created (and not a 64G monster!). Much better! However as time went on, the file slowly grew. The file on the Mac never shrank, even if containers were deleted. Docker for Mac was only seeing reads and writes, but it could not see "file deletes".

The fix: enabling TRIM and writing an online disk defrag in OCaml

It turns out that modern SSD disks have a similar problem. Inside SSDs there are blocks of flash memory, much larger than a typical filesystem block. When a filesystem writes to a block which already contains data, the SSD must read the current block, apply the update, and write the whole block back. If the OS doesn't actually care about the other data in the block then the SSD has to do a lot of extra work for nothing. The TRIM command is a way for the OS to tell the SSD that a portion of the data is no longer needed. When a file is deleted, the OS will TRIM the disk region where the data was stored. The SSD can then mark these areas as free, and know that it can ignore or reuse them in the future.

We added TRIM support to hyperkit but we still needed to actually do something when the TRIM requests were received.

The qcow2 format makes it fairly easy to mark a block as free, but this doesn't fix the bug by itself, since the file on the Mac is still large; HFS+ cannot use the free space. To shrink the file we needed to move blocks from the end of the file into the holes created by the TRIM. Our initial implementation was a simple defrag which ran at startup, but this meant that the file would only shrink when Docker Desktop was restarted. To fully fix the bug we needed an online defrag which could run while Docker Desktop was running.

We implemented an online defrag which ran continuously in the background, keeping track of which regions of the file were free and opportunistically copied blocks from later in the file to fill the holes. After moving the last block in the file it could ftruncate the file to the new size. However correctness is hard to achieve. For example it had to be careful to abandon any copies should the originals be modified concurrently. Furthermore the defrag had to be safe even if the host crashed during the move; consider what would happen if a block was copied and then a pointer updated to reference it. At one point there would be two requests in the hardware disk controller's queue. For performance reasons these writes could be permuted. In the worst case scenario where the power failed between the two writes we could discover that the pointer change was committed but the block copy was lost, leaving the pointer to point to garbage. The defrag was careful to use primitives like F_FULLSYNC to ensure that the writes were durable, but without unacceptably reducing performance.

The result looked somewhat like this:

Free

Allocated

Moving

Highest Written Block