Wednesday, May 20, 2015

Containers are hot. Everyone loves them. Developers love the ease of creating a "bundle" of something that users can consume; DevOps and information-technology departments love the ease of management and deployment. To a large degree, containers entered the spotlight when Docker changed the application-development industry on the server side in a way that resembles how the iPhone changed the client application landscape.
The word "container" is not just used for applications, though; it is also used to describe a technology that can run a piece of software in an isolated way. Such containers are about using control groups to manage resources and kernel namespaces to limit the visibility and reach of your container app. For the typical LWN reader, this is likely what one thinks about when encountering the word "container."

Many people who advocate for containers start by saying that virtual machines are expensive and slow to start, and that containers provide a more efficient alternative. The usual counterpoint is about how secure kernel containers really are against adversarial users with an arsenal of exploits in their pockets. Reasonable people can argue for hours on this topic, but the reality is that quite a few potential users of containers see this as a showstopper. There are many efforts underway to improve the security of containers and namespaces in both open-source projects and startup companies.

We (the Intel Clear Containers group) are taking a little bit of a different tack on the security of containers by going back to the basic question: how expensive is virtual-machine technology, really? Performance in this regard is primarily measured using two metrics: startup time and memory overhead. The first is about how quickly your data center can respond to an incoming request (say a user logs into your email system); the second is about how many containers you can pack on a single server.

We set out to build a system (which we call "Clear Containers") where one can use the isolation of virtual-machine technology along with the deployment benefits of containers. As part of this, we let go of the "machine" notion traditionally associated with virtual machines; we're not going to pretend to be a standard PC that is compatible with just about any OS on the planet.

To provide a preview of the results: we can launch such a secured container that uses virtualization technology in under 150 milliseconds, and the per-container memory overhead is roughly 18 to 20MB (this means you can run over 3500 of these on a server with 128GB of RAM). While this is not quite as fast as the fastest Docker startup using kernel namespaces, for many applications this is likely going to be good enough. And we aren't finished optimizing yet.

So how did we do this?


With KVM as the hypervisor of choice, we looked at the QEMU layer. QEMU is great for running Windows or legacy Linux guests, but that flexibility comes at a hefty price. Not only does all of the emulation consume memory, it also requires some form of low-level firmware in the guest as well. All of this adds quite a bit to virtual-machine startup times (500 to 700 milliseconds is not unusual).

However, we have the kvmtool mini-hypervisor at our disposal (LWN has covered kvmtool in the past). With kvmtool, we no longer need a BIOS or UEFI; instead we can jump directly into the Linux kernel. Kvmtool is not cost-free, of course; starting kvmtool and creating the CPU contexts takes approximately 30 milliseconds. We have enhanced kvmtool to support execute-in-place on the kernel to avoid having to decompress the kernel image; we just mmap() the vmlinux file and jump into it, saving both memory and time.


A Linux kernel boots pretty fast. On a real machine, most of the boot time in the kernel is spent initializing some piece of hardware. However, in a virtual machine, none of these hardware delays are there—it's all fake, after all—and, in practice, one uses only the virtio class of devices that are pretty much free to set up. We had to optimize away a few early-boot CPU initialization delays; but otherwise, booting a kernel in a virtual-machine context takes about 32 milliseconds, with a lot of room left for optimization.

We also had to fix several bugs in the kernel. Some fixes are upstream already and others will go upstream in the coming weeks.

User space

In 2008 we talked about the 5-second boot at the Plumbers Conference, and, since then, many things have changed—with systemd being at the top of the list. Systemd makes it trivial to create a user space environment that boots quickly. I would love to write a long essay here about how we had to optimize user space, but the reality is—with some minor tweaks and just putting the OS together properly—user space boots pretty quickly (less than 75 milliseconds) already. (When recording bootcharts at high resolution sampling, it's a little more, but that's all measurement overhead.)

Memory consumption

A key feature to help with memory consumption is DAX, which the 4.0 kernel now supports in the ext4 filesystem. If your storage is visible as regular memory to the host CPU, DAX enables the system to do execute-in-place of files stored there. In other words, when using DAX, you bypass the page cache and virtual-memory subsystem completely. For applications that use mmap(), this means a true zero-copy approach, and for code that uses the read() system call (or equivalent) you will have only one copy of the data. DAX was originally designed for fast flash-like storage that shows up as memory to the CPU; but in a virtual-machine environment, this type of storage is easy to emulate. All we need to do on the host is map the disk image file into the guest's physical memory, and use a small device driver in the guest kernel that exposes this memory region to the kernel as a DAX-ready block device.

What this DAX solution provides is a zero-copy, no-memory-cost solution for getting all operating-system code and data into the guest's user space. In addition, when the MAPPRIVATE flag is used in the hypervisor, the storage becomes copy-on-write for free; writes in the guest to the filesystem are not persistent, so they will go away when the guest container terminates. This MAPPRIVATE solution makes it trivial to share the same disk image between all the containers, and also means that even if one container is compromised and mucks with the operating-system image, these changes do not persist in future containers.

A second key feature to reduce memory cost is kernel same-page merging (KSM) on the host. KSM is a way to deduplicate memory within and between processes and KVM guests.

Finally, we optimized our core user space for minimal memory consumption. This mostly consists of calling the glibc malloc_trim() function at the end of the initialization of resident daemons, causing them to give back to the kernel any malloc() buffers that glibc held onto. Glibc by default implements a type of hysteresis where it holds on to some amount of freed memory as an optimization in case memory is needed again soon.

Next steps

We have this working as a proof of concept with rkt (implementing the appc spec that LWN wrote about recently). Once this work is a bit more mature, we will investigate adding support into Docker as well. More information on how to get started and get code can be found at, which we will update as we make progress with our integration and optimization efforts.

from lizard's ghost

No comments:

Post a comment