Deep-dive into Containerization: Creating containers from scratchPublished on Mar 6, 2022Updated on Jun 19, 2022Read on dev.to

Estimated reading time : 14 minutes

We will go through how containers work and create a containerized process using the unshare linux command.

containerlinuxos

Containers have changed the landscape for how we design, develop, and deploy your applications. Today, cloud native technologies are transforming IT ecosystems largely thanks to containerization. In this article, I’ll be creating a container runtime using shell commands. Ideally it’s not recommended to implement your own container runtime. This is just to get a better understanding of lower level linux functionalities which will help in
  1. Designing more secure images.
  1. Using images more efficiently.
  1. Debugging while using higher level tools

Linux Fundamentals

Before I start the deep dive, it is important to be familiar with certain linux concepts.

Cgroups

Cgroups limit the resources, such as memory, CPU, and network input/output, that a group of processes can use. There is a hierarchy of control groups for each type of resource being managed, and each hierarchy is managed by a cgroup controller. Any Linux process is a member of one cgroup of each type, and when it is first created, a process inherits the cgroups of its parent.
The Linux kernel communicates information about cgroups through a set of pseudo-filesystems that typically reside at /sys/fs/cgroup. You can see the different types of cgroups on your system by listing the contents of that directory.
user@myPChostname:~$ ls /sys/fs/cgroup/ blkio cpuacct cpuset freezer memory net_cls net_prio pids systemd cpu cpu,cpuacct devices hugetlb misc net_cls,net_prio perf_event rdma unified
If you have docker installed on the system and look inside the /sys/fs/cgroup/memory directory, you’ll find a directory for docker. All the files in this directory define different kinds of memory limits on your docker containers. You’ll find a similar directory in /sys/fs/cgroup/cpu where the cpu limits for your docker containers are defined.

Namespaces

By putting a process in a namespace, you can restrict the resources that are visible to that process.
Linux kernel 5.6 currently provides 8 namespaces:
  • pid : provides a process with its own set of process IDs
  • net : allows processes to have their own network stack
  • mnt : abstracts filesystem view and manages mount points
  • ipc : provides separation of named shared memory segments
  • user : provides processes with their own set of user IDs and group IDs
  • uts : allows processes to have own domain name and hostname
  • cgroup : allows a process to have its own set of cgroup root directories
  • time : virtualize the clock of the system
A process is always in exactly one namespace of each type. When you start a Linux system it has a single namespace of each type. You can easily see the namespaces on your machine using the lsns command.
The unshare command allows us to create sub processes that don't share namespaces with their parent process. You can also use the nsenter command to specify namespaces for a process. In this article, I’ll stick to using unshare .

Creating a Containerized Process

Containers seem to be very similar to virtual machines, but it’s crucial to understand that they are very different. While virtual machines emulate a complete machine, including the operating system and a kernel, containers share the kernel of the host machine and, as explained, are only isolated processes.

Hostname

Let’s start by isolating the hostname. If you run the hostname command from within a docker container, you can see that it’s a different hostname than your host.
user@myPChostname:~$ hostname myPChostname user@myPChostname:~$ docker run --rm -it --name hello centos bash [root@f1e54241a12b /]$ hostname f1e54241a12b
To achieve a similar isolation, we need to give its own UTS namespace using the unshare command.
💡
I am running these bash commands on an ubuntu VM created using multipass.
To create a new UTS namespace, we can use the --uts flag with unshare.
ubuntu@host:/$ hostname host ubuntu@host:/$ sudo unshare --uts bash root@host:/$ hostname child root@host:/$ hostname child
If you were to open another terminal window to the same host before exit, you can confirm that the hostname hasn’t changed for the whole (virtual) machine.
ubuntu@host:/$ hostname host

Filesystem

Next, we need to give our containerized process it’s own root filesystem so that it does not access the host root. We’ll be using the --root option to do that. This will help us assign a directory as the new root. But before we do that, for any directory to be a root directory, it requires a root filesystem which includes directories such as /bin, /proc etc. So I am going to download the alpine minirootfs to quickly create a minimal root filesystem in my new directory. You can also export root filesystems from existing docker containers if you want.
ubuntu@host:~$ mkdir container_root ubuntu@host:~$ cd container_root/ ubuntu@host:~/container_root$ curl -o alpine.tar.gz https://dl-cdn.alpinelinux.org/alpine/latest-stable/releases/x86_64/alpine-minirootfs-3.15.0-x86_64.tar.gz ubuntu@host:~/container_root$ tar xvf alpine.tar.gz ubuntu@host:~/container_root$ rm alpine.tar.gz ubuntu@host:~/container_root$ ls bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr
So now to use the --root option with the unshare command
ubuntu@host:/$ sudo unshare --uts \ --root=/home/ubuntu/container_root \ sh /$ pwd / /$ ls / bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr
The root directory of the containerized process is no longer the root directory of our host system. Which also means we can’t use the commands from the host machine /bin which are not present in the new root filesystem.
ubuntu@host:/$ sudo unshare --uts \ --root=/home/ubuntu/container_root \ bash chroot: failed to run command ‘bash’: No such file or directory

Processes

Now if we run the ps command, as you can see we can’t see any processes at all. That’s because the ps command runs by listing the /proc psuedo-filesystem. All processes have their own directory within the /proc . You can run ls /proc on your linux system to see how it looks like. You can read more about the /proc filesystem here.
You can mount /proc using the mount command or by using the --mount-proc flag.
ubuntu@host:/$ sudo unshare --uts \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh /$ ps PID USER TIME COMMAND 1 root 0:01 {systemd} /sbin/init 2 root 0:00 [kthreadd] 3 root 0:00 [rcu_gp] 4 root 0:00 [rcu_par_gp] 6 root 0:00 [kworker/0:0H-kb] ... <truncated>
Now, we can see all the processes running on host which is not right. Containers should not be able to access the processes of the host machine. To isolate the host processes, We use the --pid flag with unshare to get a new PID namespace. Along with that we also need to use the --fork flag. This is useful when creating a new pid namespace as --fork runs the specified program as a child process of unshare rather than running it directly.
ubuntu@host:/$ sudo unshare --uts \ --pid --fork \ --mount-proc=proc \ --root=/home/ubuntu/container_root sh /$ ps PID USER TIME COMMAND 1 root 0:00 sh 2 root 0:00 ps

Mounts

Now we have our processes isolated. The next namespace we need to look into is the mount namespace. We can do that using the --mount flag in the unshare command. This isolation comes handy in keeping in making sure host directories mounted into container are not visible from other containers.
ubuntu@host:~$ sudo unshare --uts \ --pid --fork \ --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh

Networking interfaces

Containers have their own networking interface and routing tables. This requires the process to have a separate network namespace which can be set using --net flag.
ubuntu@host:~$ sudo unshare --uts \ --net \ --pid --fork \ --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh /$ ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
When we create our network namespace, we only have a loopback interface. The container will be unable to communicate if it only has a loopback interface. We need to estabilish a virtual Ethernet interface connecting the container network namespace to the default one.
While keeping the containerized process running in one terminal. Open another with root priveleges. Create a virtual ethernet interface on your host machine. You’ll need to know your container’s pid for that. We can use the lsns command to find that.
ubuntu@host:/$ sudo lsns -t net NS TYPE NPROCS PID USER NETNSID NSFS COMMAND 4026531992 net 93 1 root unassigned /sbin/init 4026532193 net 2 2241 root unassigned unshare --uts --net --pid --fork --mount --mo ubuntu@host:/$ sudo ip link add ve1 netns 2241 type veth peer name ve2 netns 1
Then we need to get the connection up. On the host machine
ubuntu@host:/$ sudo ip link set ve2 up
In the container process
/$ sudo ip link set ve1 up
Now that the connection is up, we assign IP. on the host machine run
ubuntu@host:/$ sudo ip addr add 192.168.1.200/24 dev ve2
on the container process run
/$ ip addr add 192.168.1.100/24 dev ve1
now you should be able to ping the host from the container and vice versa, allowing your container to communicate with other processes.

Inter process communication

Different processes communicate with each other with the help of a shared range of memory. For that, they need to part of the same IPC namespace.We generally wouldn’t want our containers to be able to access one another’s shared memory. In which case we can use the --ipc flag.
ubuntu@host:~$ sudo unshare --uts \ --net --ipc \ --pid --fork \ --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh

Cgroups

You can use the --cgroup flag to create a new cgroup namespace which makes sure that your container process cannot see any higher cgroup configuration.
ubuntu@host:~$ sudo unshare --uts \ --net --ipc --cgroup \ --pid --fork \ --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh

Users

Currently, the user in the containerized process is the root user because we use sudo.
ubuntu@host:~$ sudo unshare --uts \ --net --ipc \ --pid --fork \ --cgroup --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh / $ id uid=0(root) gid=0(root) groups=0(root)
To prevent this, we create a separate user namespace for the container process with the help of --user flag.
ubuntu@host:~$ sudo unshare --user \ --uts --net --ipc \ --pid --fork \ --cgroup --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh ~ $ id uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
Now the user assigned to the container process is “nobody”. We can map this uid to a non-root user on the host machine by making a change in the /proc/<pid>/uid_map where <pid> is your container process pid. The user namespace is created first when you run unshare with the --user flag and you are automatically root in the container user namespace. This means you can create namespaces inside the containerized process while running unshare without sudo allowing us to run containers without any root privileges. (rootless containers)
ubuntu@host:~$ unshare --uts \ --net --ipc \ --pid --fork \ --cgroup --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh unshare: unshare failed: Operation not permitted ubuntu@host:~$ unshare --user \ --uts --net --ipc \ --pid --fork \ --cgroup --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh ~ $
And here we have our container ready!!
All your container runtime tools are wrappers around these in-built features which provide you more ease and flexibility of configuration.

More resources