Linux Container Internals (Part I)

Namespaces and control groups (cgroups) are responsible for the magic behind Linux containers. The support for namespaces initially appeared in 2.4.19 kernels (mount point/file system isolation), but there are now six different types of namespace abstractions in the mainline of the contemporary kernels. From the kernel’s perspective, a container is just another process with its own set of resources - file descriptors, process address space and processor’s state. For instance, a containerized nginx web server exposes external PIDs for its master and worker processes:

[nedo@archrabbit]$ ps  -eo pid,comm,cmd | grep nginx
4971 nginx           nginx: master process nginx -g daemon off;
4987 nginx           nginx: worker process

From the output of the ps command it’s hard to differentiate containerized processes from other processes running on the host. We can provide another option to the ps command so it also shows the cgroup to which the process is bound.

[nedo@archrabbit]$ ps  -eo pid,comm,cmd,cgroup | grep nginx
4971 nginx           nginx: master process nginx      9:blkio:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
8:net_cls:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
7:devices:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
6:pids:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
5:cpu,cpuacct:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
4:freezer:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
3:cpuset:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
2:memory:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
1:name=systemd:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326
4987 nginx           nginx: worker process       9:blkio:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
8:net_cls:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
7:devices:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
6:pids:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
5:cpu,cpuacct:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
4:freezer:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
3:cpuset:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
2:memory:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326,
1:name=systemd:/docker/91dfd31b99a29c145b2f183970fc9c197261c8381463330aef8c262abe751326

The output is much more verbose now, but we can see processes are attached to the cgroups of the docker hierarchy, and so assume those are running inside container.

Surprisingly, the life of a container starts by issuing a clone system call which creates a new process descriptor. The newborn process can share a number of resources with its parent process depending on the value of the flags argument. Typically, both child and parent share the same memory address space until one of them decides to write a new memory page, when a copy of that page is moved to the address space of the process which requested the write operation (this optimization technique is commonly known as CoW). Besides sharing the memory address space, they often share file descriptor table and file system information. However, the child process may ask for a separate system resource, including an isolated namespace by providing one of the following flags - CLONE_NEWNS, CLONE_NEWPID, CLONE_NEWUSER, CLONE_NEWUTS, CLONE_NEWIPC, CLONE_NEWNET. For each of them a new namespace is created and the child process becomes the member of that namespace. The perception of the process is to have its own instance of the system resource that’s only visible to the members of the same namespace. That could be analogous to how kernel provides an illusion to a process through processor virtualization and virtual memory that it’s the only running process on the system (when it’s actually sharing the CPU cycles and physical memory with another processes).

The process can call unshare system call as an alternative mechanism to namespace creation, as well as it can join an existing namespace via setns syscall. The latter needs a file descriptor that identifies a namespace to which the process would like to join (the process can obtain the file descriptor from /proc/[pid]/ns).

File system isolation

Mount namespace isolates the set of mount tables. Thus, the collection of processes have a completely independent view of the file system hierarchy. The mount points are only visible to a group of processes of the same mount namespace and they don’t propagate to other mount namespaces, providing the ability to the process to have its own rootfs. To create a new mount namespace, we pass the CLONE_NEWFS flag to the clone syscall. In the example above, we allocate some memory for the child’s stack and pass the callback function that gets executed in the context of the child process. If the call to clone is successful, we’ll have a child process attached to a brand new mount namespace.

fn main() {
    let stack = &mut[0; 1024 * 1024];
    match unsafe {
        clone(child_cb,
              stack.as_mut_ptr() as *mut c_void,
              CLONE_NEWNS,
              ptr::null_mut()
        )
    } {
        -1 => panic!("unable to create child process"),
        _ => {}
    }
}

However, because the child process inherits the copy of the parent’s mount namespace, an invocation to pivot_root is required to change the root file system of the process. One of the requirements of the pivot_root syscall enforces the file system directories that are about to be swapped, can’t share the same tree. Calling the mount function with MS_BIND flag get our way out.

NOTE: pivot_root may also fail if the root file system is mounted as shared. To workaround that, run this command: mount --make-rprivate /

fn pivot_root(rootfs: String) -> Result<(), &'static str> {
    unsafe {
        if mount(rootfs.as_ptr() as *const i8,
                 rootfs.as_ptr() as *const i8,
                 ptr::null(),
                 MS_BIND,
                 ptr::null()) != 0 {
            return Err("unable to mount rootfs");
        }
        let oldrootfs = String::from(format!("{}/.oldrootfs", rootfs.clone()));
        if !Path::new(&oldrootfs).exists() {
            create_dir(oldrootfs.clone());
        }
        if sys_pivot_root(rootfs, oldrootfs) != 0 {
            return Err("unable to change rootfs");
        }
        // change to root directory
        Command::new("chdir").arg("/").spawn();
        Ok(())
    }
}

PID and IPC namespaces

Recall we mentioned the containerized nginx’s instance exposes its external PIDs on the host system. Besides that, every process inside container has an internal PID. This correspondes to a numeric value of 1 for the first process inside container which acts as the init process (waits for and reaps orphaned child processes). The isolation of the PID number space is the guarantee for different PID namespaces to be able to have processes with same PIDs. The PID namespace is created by passing the CLONE_NEWPID flag to the clone or unshare system calls. IPC namespace (which it’s name implies) isolates interprocess communication mechanisms such as POSIX message queues or System V IPC objects. Passing CLONE_NEWIPC flag to the clone system call creates an isolated IPC namespace.

Here is the full source code that illustrates the creation of new mount, PID and IPC namespaces. Note the code has been reduced to the minimum for simplicity reasons. It makes use of the libc crate to invoke the system calls through standard C library. If you are looking for a high level abstraction, check out nix.

extern crate libc;

use libc::{c_void,
           c_int,
           c_long,
           clone,
           mount,
           syscall,
           MS_BIND,
           CLONE_NEWNS,
           CLONE_NEWPID,
           CLONE_NEWIPC};

use std::ptr;
use std::path::Path;
use std::fs::create_dir;
use std::process::Command;

static SYSPIVOTROOT: c_long = 155;

extern "C" fn child_cb(args: *mut c_void) -> c_int {
    match pivot_root(String::from(rootfs)) {
        Ok(()) => {
            // we are now inside container
            // execute any command of your choice
            println!("{:?}",
                     Command::new("cat").arg("/etc/issue").output());
        },
        Err(e) => {
            println!("error: {}", e);
        }
    };
    0
}

fn pivot_root(rootfs: String) -> Result<(), &'static str> {
    unsafe {
        if mount(rootfs.as_ptr() as *const i8,
                 rootfs.as_ptr() as *const i8,
                 ptr::null(),
                 MS_BIND,
                 ptr::null()) != 0 {
            return Err("unable to mount rootfs");
        }
        let oldrootfs = String::from(format!("{}/.oldrootfs", rootfs.clone()));
        if !Path::new(&oldrootfs).exists() {
            create_dir(oldrootfs.clone());
        }
        if sys_pivot_root(rootfs, oldrootfs) != 0 {
            return Err("unable to change rootfs");
        }
        // change to root directory
        Command::new("chdir").arg("/").spawn();
        Ok(())
    }
}

fn sys_pivot_root(root: String, oldroot: String) -> c_long {
    unsafe {
        syscall(SYSPIVOTROOT, root.as_ptr(), oldroot.as_ptr())
    }
}

fn main() {
    let stack = &mut[0; 1024 * 1024];
    match unsafe {
        clone(child_cb,
              stack.as_ptr() as *mut c_void,
              CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWIPC,
              ptr::null_mut())
    } {
        -1 => panic!("unable to create child process"),
        _ => {}
    }
}

The place where bunnies dwell

And bits become colossal

Linux Container Internals (Part I)

File system isolation

PID and IPC namespaces

The place where bunnies dwell

And bits become colossal

Linux Container Internals (Part I)

File system isolation

PID and IPC namespaces

You might also enjoy (View all posts)