Namespaces and control groups (cgroups) are responsible for the magic behind Linux containers. The support for namespaces initially appeared in 2.4.19 kernels (mount point/file system isolation), but there are now six different types of namespace abstractions in the mainline of the contemporary kernels. From the kernel’s perspective, a container is just another process with its own set of resources - file descriptors, process address space and processor’s state. For instance, a containerized nginx web server exposes external PIDs for its master and worker processes:
From the output of the ps
command it’s hard to differentiate containerized processes from other processes running on the host. We can provide another option to the ps
command so it also shows the cgroup to which the process is bound.
The output is much more verbose now, but we can see processes are attached to the cgroups of the docker hierarchy, and so assume those are running inside container.
Surprisingly, the life of a container starts by issuing a clone
system call which creates a new process descriptor. The newborn process can share a number of resources with its parent process depending on the value of the flags
argument. Typically, both child and parent share the same memory address space until one of them decides to write a new memory page, when a copy of that page is moved to the address space of the process which requested the write operation (this optimization technique is commonly known as CoW). Besides sharing the memory address space, they often share file descriptor table and file system information. However, the child process may ask for a separate system resource, including an isolated namespace by providing one of the following flags - CLONE_NEWNS
, CLONE_NEWPID
, CLONE_NEWUSER
, CLONE_NEWUTS
, CLONE_NEWIPC
, CLONE_NEWNET
. For each of them a new namespace is created and the child process becomes the member of that namespace. The perception of the process is to have its own instance of the system resource that’s only visible to the members of the same namespace. That could be analogous to how kernel provides an illusion to a process through processor virtualization and virtual memory that it’s the only running process on the system (when it’s actually sharing the CPU cycles and physical memory with another processes).
The process can call unshare
system call as an alternative mechanism to namespace creation, as well as it can join an existing namespace via setns
syscall. The latter needs a file descriptor that identifies a namespace to which the process would like to join (the process can obtain the file descriptor from /proc/[pid]/ns
).
File system isolation
Mount namespace isolates the set of mount tables. Thus, the collection of processes have a completely independent view of the file system hierarchy. The mount points are only visible to a group of processes of the same mount namespace and they don’t propagate to other mount namespaces, providing the ability to the process to have its own rootfs
. To create a new mount namespace, we pass the CLONE_NEWFS
flag to the clone
syscall. In the example above, we allocate some memory for the child’s stack and pass the callback function that gets executed in the context of the child process. If the call to clone
is successful, we’ll have a child process attached to a brand new mount namespace.
However, because the child process inherits the copy of the parent’s mount namespace, an invocation to pivot_root
is required to change the root file system of the process. One of the requirements of the pivot_root
syscall enforces the file system directories that are about to be swapped, can’t share the same tree. Calling the mount
function with MS_BIND
flag get our way out.
NOTE: pivot_root
may also fail if the root file system is mounted as shared. To workaround that, run this command: mount --make-rprivate /
PID and IPC namespaces
Recall we mentioned the containerized nginx’s instance exposes its external PIDs on the host system. Besides that, every process inside container has an internal PID. This correspondes to a numeric value of 1
for the first process inside container which acts as the init
process (waits for and reaps orphaned child processes). The isolation of the PID number space is the guarantee for different PID namespaces to be able to have processes with same PIDs. The PID namespace is created by passing the CLONE_NEWPID
flag to the clone
or unshare
system calls.
IPC namespace (which it’s name implies) isolates interprocess communication mechanisms such as POSIX message queues or System V IPC objects. Passing CLONE_NEWIPC
flag to the clone
system call creates an isolated IPC namespace.
Here is the full source code that illustrates the creation of new mount, PID and IPC namespaces. Note the code has been reduced to the minimum for simplicity reasons. It makes use of the libc
crate to invoke the system calls through standard C library. If you are looking for a high level abstraction, check out nix.