Linux Kernel Summit: Containers Update
Containers is a significant new development activity in the Linux kernel today. And, it has drawn together a wide variety of independent implementations (out of kernel, today) working towards a common goal. Representing the broader community on the Kernel Summit update was Paul Menage of Google and Eric Biederman, known previously for his work on kexec and kdump, amongst other things. The discussion started with some highlights about containers, including the rather well-agreed to concept of resource namespaces. Namespaces provide a means of grouping various resources, such as process (task) ID's, IPC's, network connections, etc. into discrete "containers" which are can be isolated from other namespaces. Ideally, applications running in a container will only see the tasks and activities related to that application in their view of the world. Also, these individual namespaces can have their resources controlled independently and one containers resource usage should not impact the resource usage of another container. Today, the UTS (utsname) and System V IPC namespace isolation mechanisms are in mainline. The pid namespace is being testined in the -mm tree and is one of the more complex namespaces to isolate. The networking namespace work has complete prototypes, but there are still discussions in progress on how this might go forward in the Linux networking space. Some have suggested that the current Level 2 and level 3 isolation mechanisms are overkill and that netlink offers a mechanism for doing much of this today. The jury is still out on the details of how this will make it to mainline, but hopefully these will be addressed over the next couple of months. Other resources which still need work include time (virtual time, primarily, I believe) and devices, such as pty's. There is also an open question as to whether there are enough CLONE_ bits for the clone() system call given all these new inheriticance properties for use during the creation of a new process or task.
Paul talked a bit about the renaming of his current work to be Control Groups - primarily because his use of the term "containers" overlapped with many of the terms used by other containers groups and both teams were getting confused. Paul talked about CFS and the ability to apply CPU weights to arbitrary groups of processes, about cpusets and some of the rework he has had to do (Andrew Morton volunteered Paul to be the cpusets maintainer since it appears Paul Jackson has taken a sabbatical from Linux Kernel development), about the memory controller work that Balbir Singh is doing, and talked a bit about the problems with the task freezer for freezing and unfreezing tasks. He also mentioned an NSProxy as a way to tie namespaces to control group, and talked about how aggregated limits and controls for swap, disk IO, dirty pages, and network restrictions could be done.
Paul also addressed the common questions about: Well, Why not just use existing interfaces such as.... setrlimit: which can only restrict tasks to simple numerical limits, with no generic support for aggregate limits, and which are only settable on the current process; or uid/gid/pgrp/session concepts: which have historical semantics which can not be co-opted, can only (generally) be set on the current process, and can't be set to arbitrary values. In other words, they don't have much value in allowing system administrators to group and manage processes and their resource consumption.
Additional benefits include the fact that control groups can be nested; they form a strict hierarchy. And, their semantics when nested will depend on the specific resource controller used. Paul pointed out that while the framework has no real measureable performance impact, various resource controllers may trade off throughput for quality of services.
When people asked about the size of the overall namespace code, the answer was that mostly existing lines of code are modified, with a minor addition of code on top of the modifications.
When people asked how close containers are to being able to replicate the functinality of Solaris Zones, the most significant missing component today is the networking code. Once that is worked out, the differences in capability between Zones and Containers is relatively small. And, chances are that Containers will provide some flexibility above what Zones provides in the foreseeable future.