A dive into the basics and working of Kubernetes

This is the first of three articles cycle that will consist of Kubernetes architecture basics, monitoring fundamentals (metrics and resource usage focused) and comparison of open-source monitoring solution along with requirements of production environments.
Each of them was part of my thesis for my master’s degree (Comparison of open-source monitoring solutions for Kubernetes cluster in a production environment). I will try to update references to particular chapters with hyperlinks to make it easier to navigate and put the bibliography in a separate article for clarity.
“Understanding Kubernetes” gives you a basic idea of what Kubernetes is and how it works. Besides chapter thesis theory background, it was aimed to be a fundamental theory background for developers that are going to interact with containers and Kubernetes and wish to learn to use it in their everyday work.
In mid-2014 Google for the first time announced the development of the Kubernetes [2]. It was heavily influenced by Google’s Borg system (cluster management system that “admits, schedules, starts, restarts, and monitors the full range of applications that Google runs” [3]). Initially, it was aimed to orchestrate containers and work with Docker.
On the 10th of July 2015, it was first released under version 1.0 and along with that release a partnership with Linux Foundation has been established forming the well-known today Cloud Native Computing Foundation (CNCF) [4]. Since then Kubernetes has evolved and gained a lot of new features and extensions. The current version as of 15.03.2021 is v1.20 in which one of the major changes is the Docker deprecation as a container runtime in favour of plain containerd that use the proper CRI created for Kubernetes [5]. But before the explanation of the Kubernetes ecosystem itself, basics like containers, container engines and container orchestrators should be explained.
Containers we see today in short is a concept of building a package of a target application along with its dependencies and running it somewhere out of the box. The said “package” is in fact an image of an operating system (except for scratch containers) and “somewhere” is a container engine. So, if one says “container” we need to think about an instance of an image running on a container engine.
Container image
On the 22nd of June in 2015, the Open Container Initiative (OCI) was launched by Docker, CoreOS and other leaders in the container industry [6]. OCI is a lightweight, open governance structure (project), formed thanks to Linux Foundation, to create open industry standards around container formats and runtime. This specification defines how an OCI Image should be created. One of the core contributors of the container format and runtime is Docker. The build process will be explained in accordance to Docker’s build engine mechanics and capabilities.
Container image consists of layers which are applied one on another. Figure 2.1 and 2.2 shows the example layers used to create an image. Image build instructions are stored as Dockerfile with each instruction representing a layer. Layers are cached during the build and SHA sums are being calculated on each. Thanks to caching rebuild times are reduced dramatically as only changed layers are rebuilt and every layer until the end of Dockerfile.
Build instructions includes setting base image, adding data (local and remote), running commands for data configuration and setting container-specific metadata (e.g., open ports, entrypoint, etc.). Having that in mind, a question for the idempotency of build instructions comes up. The problem is — if a remote package from the network is being downloaded as a layer, would it be redownloaded during the rebuild or a cached layer would be used? The answer is — it will not be rebuilt if not explicitly forced to do so with a proper flag to the Docker build command. That forces developers to pin versions of 3rd party packages where possible, to maintain idempotency in case of a need for a full rebuild of the image.
Container
After a container image is built and ready to run. It is deployed to a container engine and spins up on top of an existing host OS without hardware virtualization. In fact, it is calling the host OS kernel to mimic the underlying OS within the container. It is important to note that containers are not virtual machines. The comparison of both can be found in chapter 2.6.
Container engines use certain Linux kernel capabilities to achieve container runtime which includes the use of cgroups (control groups) and namespaces (see Figure 2.4). Control groups allow for specifying limits and measuring of the total resources used by a group of processes. These can be f. ex. CPU, memory, network or IO quotas. Namespaces, on the other hand, are used to narrow the visibility that a group of processes has through certain process tree, network interfaces, user IDs or filesystem mounts [7].
The assemblance of cgroups and namespaces for a certain container produces a fully isolated environment running within a host Linux system. The existence and use of these kernel capabilities is a pure explanation of why containers are so good at scaling and security at the same time. They have access to the same pool of resources as other containers running on the system, but have set programmatically resource quotas and visibility limitations they cannot cross.
The lifecycle of a container is controlled by a container engine (container runtime) that will be explained in the next section.
Container runtime (a.k.a. container engine) is where containers are deployed to run (see Figure 2.5). These are several common container runtimes:
- containerd [8],
- CRI-O [9],
- Docker [10].
The job of those is to control the lifecycle of the containers at the users will, so that they can be run, stopped, resumed or destroyed. They also provide an interface to interact with containers in several ways:
- execute a command in the container,
- check the stdout of the container,
- get metrics (e. g. CPU usage, memory usage & limit),
- check container configuration details (e. g. assigned IP address).
As seen in Figure 2.6, container runtime, besides the container lifecycle control, serves as a proxy between operating systems inside containers and the host operating system.
Further reading about containers and container engines can be found at the RedHat blog [11].
Container orchestrators are one layer higher in managing resources and lifecycles than container engines (see Figure 2.7). Whereas container engine enables running containers on a single node, the orchestrator allows synchronized scheduling of containers on multiple nodes despite the localization and node host OS configuration.
Currently available container orchestrators enable fully multi-cloud, multi-region setups, even multi-architecture setups (e. g. a few local nodes on arm64 architecture on Raspberry Pi’s and the rest on amd64 virtualized EC2 nodes in AWS cloud, that makes a multi-region, hybrid, multi-arch setup). In short, orchestrators manages container runtimes along with providing extensions on many fields like networking enhancement using CNI (Container Network Interface).
Multi-cloud design (see an example in Figure 2.8) is an architecture where multiple cloud providers are involved, f. ex. a cluster with a mix of Amazon Web Services (AWS) EC2 instances and Google Cloud Platform (GCP) Google Compute Engine (GCE) instances. This might be used because of various reasons, like price/value differences or legacy systems deployed on one of the clouds while new products are deployed on the new one.
Multi-region setup is when multiple k8s clusters are working together to achieve a single goal in multiple regions, f. ex. (see Figure 2.9) one cluster is deployed in Frankfurt and the second one in Ireland, both are running the same application that connects to the same database (DB) in Frankfurt, this way clients in both regions can have reasonable latencies and therefore better user experience.
Multi-architecture setup is when in a single Kubernetes cluster a few hardware architectures are used to assemble a cluster. One of the use cases is the hardware efficiency while deploying architecture-bound software at the same time, f. ex. deploying a cluster made from standard x86 amd64 nodes where only amd64 compatible apps are running and deploying the rest of the applications on highly energy-efficient ARM nodes.
Resource decomposition
One of the most revolutionary factors of containers orchestrators success in the industry is the ability for managing resources. When a cluster of nodes that will be orchestrated is assembled, its resources are being virtually decomposed (see Figure 2.11), which means that they are no longer seen as separate spaces where workloads can be deployed, but rather as a single pool of resources that is available to use.
Therefore, if we assemble a cluster from 3 machines each having 4 vCPUs and 16GB of RAM we will see a pool of resources containing 12 vCPUs and 48GB of RAM. However, there are still physical boundaries that have to be taken into account. Container orchestrators were created to support the architecture of microservices, so a single microservice is relatively smaller than monolith systems. But, if in the above example we were about to deploy a workload that would need 6 vCPUs to be allocated and 8GB of memory, the container orchestrator wouldn’t be able to schedule deployment of such workload. That is, because of physical resource boundaries that cannot be crossed, multi-node CPU computing would be inefficient for systems not designed to do so and because of the containerization used — it should be universal. So, orchestrators take out the responsibility for efficient workload allocation processes from the cluster administrators, but sizing of workloads should be still taken into account when designing the configuration of the cluster.
The container orchestrator that brings the attention of the most nowadays is Kubernetes (used by 78% of companies that took part in the CNCF survey) [13]. It is a sophisticated, and yet in continuous development, tool and engine to orchestrate both homo and heterogeneous environments consisting of containers and 3rd party custom operators. The Kubernetes cluster consists of a control-plane and a data-plane (see Figure 2.12). The core functionalities are provided by the components of a control-plane (master nodes):
- etcd database,
- scheduler,
- API server,
- controller manager.
Every node is also having a:
- kubelet,
- kube-proxy,
- container-runtime.
Nodes that have only these form a data-plane and are called the worker nodes (see Figure 2.12). All of the components above will be explained in the next sections.
Kubernetes state is stored as objects in the database. Objects can be easily written as manifests in YAML. They can also be viewed in various other formats e. g. JSON using k8s CLI called kubectl. The most important object to explain (for the needs of this thesis) in Kubernetes is a pod. A pod is an object that configures a container (or a few) to run in the cluster with proper labels, security and network configuration.
Kubernetes is currently used in two schemas: self-managed (called standard) and provider-managed (called managed). They differ in how control-plane is provided for the end-user and will be explained in the following sections.
The control-plane
The control-plane’s responsibility is to manage the whole cluster with the use of the internal components.
Etcd database is where cluster state is stored. It can be deployed in both high-availability (HA) and single-node fashion. Every change to the cluster has a reference in the etcd
database, so that in case of failure of any node, it can be reconciled directly from the stored data of the state.
The scheduler’s responsibility is to schedule Pods in the cluster. It watches for newly created Pods and based on many factors it is making a decision on which node it should be scheduled. These factors include: “individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, and deadlines” [15].
Controller manager is in fact a compilation of multiple controllers that watches k8s objects and maintains cluster state. Some types of these controllers are node controller (manages nodes connectivity), job controller (one-time-off type of Pods) and endpoint controller (joins Pods and services objects).
API server serves as a proxy between control-plane components, but also enables an interface for the user of the cluster. All manifests that are deployed to the cluster goes through the API server where it is parsed and error checking is done before they are inserted into the etcd database. The observability of the objects running in the cluster is also provided by the API server.
The components described above form the control plane, but these are not the only components that are running on these nodes. There is also a kubelet, kube-proxy and container-runtime which are running on every node in the cluster and will be explained in the next section.
The data-plane
The data plane’s responsibility is to maintain the connection with the master plane, provide low-level container management capabilities and observability for the currently running workloads.
Kubelet is an agent that runs on each node in the cluster. “It makes sure that containers are running in a Pod” [15]. However, It is not managing the containers that were not created by Kubernetes.
Kube-proxy is a networking component that maintains network rules on nodes and allows for network sessions inside and outside of the cluster.
The container-runtime is also a core component of the data plane that is available on every node.
Managed Kubernetes
“The big three” cloud providers — AWS (Amazon Web Services), GCP (Google Cloud Platform), and Azure (Microsoft) provides a service called Managed Kubernetes. Each has its own naming — EKS on AWS, GKE on GCP and AKS on Azure:
- Amazon’s Elastic Kubernetes Service (EKS)
- Microsoft’s Azure Kubernetes Service (AKS)
- Google’s Kubernetes Engine (GKE).
The difference from the standard way of deployment and usage is that control-plane is in this case managed by the cloud-provider. The user still has the ability to configure the cluster using the configuration in manifests, but all of the responsibility of control-plane health and HA relies on the service provider.
At the same time, worker nodes management is still a cluster user’s responsibility. Some of the cloud-providers offer also automated control-plane updates with a few manual configuration steps (EKS) and other offers even fully automated cluster upgrade process along with worker nodes (GKE) [16].
Considering features of the container orchestrator and an insane amount of data flowing through the Kubernetes cluster, the number of fields that could be monitored is huge and includes:
- node resource utilization,
- network policies (f. e. deny hits),
- network traffic utilization,
- network traffic types,
- application metrics,
- and many more…
For the sake of this thesis, we will consider node resource utilization (CPU and RAM) in depth only. There is a wide range of monitoring solutions available in the market. Many of them offer a complete solution for Kubernetes cluster monitoring which includes not only metrics gathering and analysis but also logging, parsing and alerting. An example of a complete solution like that is ELK Stack with Metricbeat by Elastic company [17] [18].
It is important to note the differences between Linux containers and virtual machines (VMs). Both are packaged computing environments that isolate various applications and their dependencies from the rest of the system. Core differences are in terms of scale and portability.
Virtual machines are designed to be big and store whole monolithic systems inside. Its resources are specified upfront and are harder to move around mostly because of hardware virtualization that implicates f. ex. system boot process.
Containers are designed to be small, contain a single app and its dependencies. It is also easy to move them around multiple environments because of its lightweight design and shared operating system.
In contrary to containers cgroups and namespaces separation design, VMs use software called a hypervisor to separate resources from physical machine to assign them to a particular virtual machine (see Figure 2.13). This is the main pain point because it enforces hard resources pinning and favours tight coupling. Whereas containers support more of a loosely-coupled and highly-cohesive architecture as seen in the example in Figure 2.14.
Therefore virtualization is rather for old monolithic or service-oriented approaches at most and containers are more for microservice architecture.
All links and sources in the article