Deep-dive into Containerization: Creating containers from scratchPublished on Mar 6, 2022Updated on Oct 20, 2022Read on

Estimated reading time : 14 minutes

We will review how containers work and create a containerized process using the unshare Linux command.


Containers have changed the landscape for how we design, develop, and deploy your applications. Today, cloud-native technologies are transforming IT ecosystems largely thanks to containerization. In this article, I’ll create a container runtime using shell commands. Ideally, it’s not recommended to implement your own container runtime. This is just to get a better understanding of lower-level Linux functionalities, which will help in
  1. Designing more secure images.
  1. Using images more efficiently.
  1. Debugging while using higher-level tools

Linux Fundamentals

Before I start the deep dive, you must be familiar with certain Linux concepts.


Cgroups limit the resources that a group of processes can use, such as memory, CPU, and network input/output. There is a hierarchy of control groups for each resource type, and each hierarchy is managed by a cgroup controller. Any Linux process is a member of one cgroup of each type, and when it is first created, a process inherits the cgroups of its parent.
The Linux kernel communicates information about cgroups through pseudo-filesystems that typically reside at /sys/fs/cgroup. You can see the different types of cgroups on your system by listing the contents of that directory.
user@myPChostname:~$ ls /sys/fs/cgroup/ blkio cpuacct cpuset freezer memory net_cls net_prio pids systemd cpu cpu,cpuacct devices hugetlb misc net_cls,net_prio perf_event rdma unified
If you have docker installed on the system and look inside the /sys/fs/cgroup/memory directory, you’ll find a directory for docker. All the files in this directory define different memory limits on your docker containers. You’ll find a similar directory in /sys/fs/cgroup/cpu where the CPU limits for your docker containers are defined.


By putting a process in a namespace, you can restrict the resources visible to that process.
Linux kernel 5.6 currently provides 8 namespaces:
  • pid : provides a process with its own set of process IDs
  • net : allows processes to have their own network stack
  • mnt : abstracts filesystem view and manages mount points
  • ipc : provides separation of named shared memory segments
  • user : provides processes with their own set of user IDs and group IDs
  • uts : allows processes to have own domain name and hostname
  • cgroup : allows a process to have its own set of cgroup root directories
  • time : virtualize the clock of the system
A process is always in exactly one namespace of each type. When you start a Linux system, it has a single namespace of each type. You can easily see the namespaces on your machine using the lsns command.
The unshare command allows us to create subprocesses that don't share namespaces with their parent process. You can also use the nsenter command to specify namespaces for a process. In this article, I’ll stick to using unshare .

Creating a Containerized Process

Containers seem very similar to virtual machines, but it’s crucial to understand that they are very different. While virtual machines emulate a complete machine, including the operating system and a kernel, containers share the kernel of the host machine and, as explained, are only isolated processes.


Let’s start by isolating the hostname. If you run the hostname command from within a docker container, you can see that it’s a different hostname than your host.
user@myPChostname:~$ hostname myPChostname user@myPChostname:~$ docker run --rm -it --name hello centos bash [root@f1e54241a12b /]$ hostname f1e54241a12b
To achieve similar isolation, we need to give its own UTS namespace using the unshare command.
I am running these bash commands on an ubuntu VM created using multipass.
To create a new UTS namespace, we can use the --uts flag with unshare.
ubuntu@host:/$ hostname host ubuntu@host:/$ sudo unshare --uts bash root@host:/$ hostname child root@host:/$ hostname child
If you were to open another terminal window to the same host before exit, you can confirm that the hostname hasn’t changed for the whole (virtual) machine.
ubuntu@host:/$ hostname host


Next, we need to give our containerized process its root filesystem so it does not access the host root. We’ll be using the --root option to do that. This will help us assign a directory as the new root. But before we do that, for any directory to be a root directory, it requires a root filesystem which includes directories such as /bin, /proc etc. So I am going to download the alpine minirootfs to quickly create a minimal root filesystem in my new directory. You can also export root filesystems from existing docker containers if you want.
ubuntu@host:~$ mkdir container_root ubuntu@host:~$ cd container_root/ ubuntu@host:~/container_root$ curl -o alpine.tar.gz ubuntu@host:~/container_root$ tar xvf alpine.tar.gz ubuntu@host:~/container_root$ rm alpine.tar.gz ubuntu@host:~/container_root$ ls bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr
So now to use the --root option with the unshare command
ubuntu@host:/$ sudo unshare --uts \ --root=/home/ubuntu/container_root \ sh /$ pwd / /$ ls / bin etc lib mnt proc run srv tmp var dev home media opt root sbin sys usr
The root directory of the containerized process is no longer the root directory of our host system. This also means we can’t use the commands from the host machine /bin which are not present in the new root filesystem.
ubuntu@host:/$ sudo unshare --uts \ --root=/home/ubuntu/container_root \ bash chroot: failed to run command ‘bash’: No such file or directory


Now, if we run the ps command, as you can see, we can’t see any processes at all. That’s because the ps command runs by listing the /proc pseudo-filesystem. All processes have their own directory within the /proc . You can run ls /proc on your Linux system to see what it looks like. You can read more about the /proc filesystem here.
You can mount /proc using the mount command or the --mount-proc flag.
ubuntu@host:/$ sudo unshare --uts \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh /$ ps PID USER TIME COMMAND 1 root 0:01 {systemd} /sbin/init 2 root 0:00 [kthreadd] 3 root 0:00 [rcu_gp] 4 root 0:00 [rcu_par_gp] 6 root 0:00 [kworker/0:0H-kb] ... <truncated>
Now, we can see all the processes running on the host, which is not right. Containers should not be able to access the processes of the host machine. To isolate the host processes, We use the --pid flag with unshare to get a new PID namespace. Along with that, we also need to use the --fork flag. This is useful when creating a new PID namespace as --fork runs the specified program as a child process of unshare rather than running it directly.
ubuntu@host:/$ sudo unshare --uts \ --pid --fork \ --mount-proc=proc \ --root=/home/ubuntu/container_root sh /$ ps PID USER TIME COMMAND 1 root 0:00 sh 2 root 0:00 ps


Now we have our processes isolated. The next namespace we need to look into is the mount namespace. We can do that using the --mount flag in the unshare command. This isolation is handy in ensuring host directories mounted into containers are not visible from other containers.
ubuntu@host:~$ sudo unshare --uts \ --pid --fork \ --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh

Networking interfaces

Containers have their own networking interface and routing tables. This requires the process to have a separate network namespace which can be set using the --net flag.
ubuntu@host:~$ sudo unshare --uts \ --net \ --pid --fork \ --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh /$ ip a 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
When we create our network namespace, we only have a loopback interface. The container cannot communicate if it only has a loopback interface. We need to establish a virtual Ethernet interface connecting the container network namespace to the default one.
While keeping the containerized process running in one terminal. Open another with root privileges. Create a virtual ethernet interface on your host machine. You’ll need to know your container’s PID for that. We can use the lsns command to find that.
ubuntu@host:/$ sudo lsns -t net NS TYPE NPROCS PID USER NETNSID NSFS COMMAND 4026531992 net 93 1 root unassigned /sbin/init 4026532193 net 2 2241 root unassigned unshare --uts --net --pid --fork --mount --mo ubuntu@host:/$ sudo ip link add ve1 netns 2241 type veth peer name ve2 netns 1
Then we need to get the connection up. On the host machine
ubuntu@host:/$ sudo ip link set ve2 up
In the container process
/$ sudo ip link set ve1 up
Now that the connection is up, we assign IP. on the host machine run
ubuntu@host:/$ sudo ip addr add dev ve2
on the container process run
/$ ip addr add dev ve1
now you should be able to ping the host from the container and vice versa, allowing your container to communicate with other processes.

Interprocess communication

Different processes communicate with each other with the help of a shared range of memory. For that, they need to part of the same IPC namespace. We generally wouldn’t want our containers to be able to access one another’s shared memory. In this case, we can use the --ipc flag.
ubuntu@host:~$ sudo unshare --uts \ --net --ipc \ --pid --fork \ --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh


You can use the --cgroup flag to create a new cgroup namespace, ensuring that your container process cannot see any higher cgroup configuration.
ubuntu@host:~$ sudo unshare --uts \ --net --ipc --cgroup \ --pid --fork \ --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh


Currently, the user in the containerized process is the root user because we use sudo.
ubuntu@host:~$ sudo unshare --uts \ --net --ipc \ --pid --fork \ --cgroup --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh / $ id uid=0(root) gid=0(root) groups=0(root)
To prevent this, we create a separate user namespace for the container process with the help of the --user flag.
ubuntu@host:~$ sudo unshare --user \ --uts --net --ipc \ --pid --fork \ --cgroup --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh ~ $ id uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
Now the user assigned to the container process is “nobody”. We can map this uid to a non-root user on the host machine by changing the /proc/<pid>/uid_map where <pid> is your container process PID. The user namespace is created first when you run unshare with the --user flag, and you are automatically root in the container user namespace. This means you can create namespaces inside the containerized process while running unshare without sudo allowing us to run containers without any root privileges. (rootless containers)
ubuntu@host:~$ unshare --uts \ --net --ipc \ --pid --fork \ --cgroup --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh unshare: unshare failed: Operation not permitted ubuntu@host:~$ unshare --user \ --uts --net --ipc \ --pid --fork \ --cgroup --mount \ --mount-proc=proc \ --root=/home/ubuntu/container_root \ sh ~ $
And here we have our container ready!!
All your container runtime tools are wrappers around these in-built features, which provide more ease and flexibility of configuration.

More resources