Ambassador post by Zou Nengren
Two months ago, we were thrilled to share insights in the article “Best Practices for Migrating VM Clusters to KubeVirt 1.0.” As previously mentioned, we have selected AlmaLinux and Kubernetes 1.28 as the foundation for virtualization, employing cgroup v2 for resource isolation. Before moving to the production phase, we encountered additional challenges, particularly related to Kubernetes, containerd, and specific issues within KubeVirt. Therefore, in this second article, our goal is to share practical experiences and insights gained before the deployment of KubeVirt into a production environment.
Latest Developments
KubeVirt containerizes the trusted virtualization layer of QEMU and libvirt, enabling the management of VMs as standard Kubernetes resources. This approach offers users a more flexible, scalable, and contemporary solution for virtual machine management. As the project progresses, we’ve identified specific misconceptions, configuration errors, and opportunities to enhance KubeVirt functionality, especially in the context of utilizing Kubernetes 1.28 and containerd. The details are outlined below:
kubernetes
kubelet ready-only port
To address security concerns, we have taken measures to mitigate potential malicious attacks on pods and containers. Specifically, we have discontinued the default opening of the insecure read-only port 10255 for the kubelet in K8s clusters running Kubernetes 1.26 or later. Instead, the authentication port 10250 is now opened and utilized by the kubelet.
service account token expiration
To enhance data security, Kubernetes 1.21 defaults to enabling the BoundServiceAccountTokenVolume feature. This feature specifies the validity period of service account tokens, automatically renews them before expiration, and invalidates tokens after associated pods are deleted. If using client-go version 11.0.0 or later, or 0.15.0 or later, the kubelet automatically reloads service account tokens from disks to facilitate token renewal.
securing controller-manager and scheduler metrics
Secure serving on port 10257 to kube-controller-manager (configurable via –secure-port) is now enabled. Delegated authentication and authorization are to be configured using the same flags as for aggregated API servers. Without configuration, the secure port will only allow access to /healthz. (#64149, @sttts) Courtesy of SIG API Machinery, SIG Auth, SIG Cloud Provider, SIG Scheduling, and SIG Testing
Added secure port 10259 to the kube-scheduler (enabled by default) and deprecate old insecure port 10251. Without further flags self-signed certs are created on startup in memory. (#69663, @sttts)
containerd
private registry
Modify your config.toml
file (usually located at /etc/containerd/config.toml
) as shown below:
tomlCopy codeversion = 2
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
In containerd registry configuration, a registry host namespace refers to the path of the hosts.toml file specified by the registry host name or IP address, along with an optional port identifier. When submitting a pull request for an image, the typical format is as follows:
plaintextCopy code
pull [registry_host_name|IP address][:port][/v2][/org_path]<image_name>[:tag|@DIGEST]
The registry host namespace part is [registry_host_name|IP address][:port]
. For example, the directory structure for docker.io looks like this:
plaintextCopy code$ tree /etc/containerd/certs.d
/etc/containerd/certs.d
└── docker.io
└── hosts.toml
Alternatively, you can use the _default
registry host namespace as a fallback if no other namespace matches.
systemd cgroup
While containerd and Kubernetes default to using the legacy cgroupfs driver for managing cgroups, it is recommended to utilize the systemd driver on systemd-based hosts to adhere to the “single-writer” rule of cgroups.
To configure containerd to use the systemd driver, add the following option in /etc/containerd/config.toml
:
tomlCopy codeversion = 2
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
Additionally, apart from configuring containerd, you need to set the KubeletConfiguration to use the “systemd” cgroup driver. The KubeletConfiguration is typically found at /var/lib/kubelet/config.yaml
:
yamlCopy codekind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: "systemd"
[community issue]containerd startup hangs when /etc is ready-only
We observed that, following the update from containerd v1.6.21 to v1.6.22, the systemd service failed to start successfully. Upon closer inspection during debugging, it was revealed that containerd did not fully initialize (lacking the “containerd successfully booted in …” message) and did not send the sd notification READY=1 event.
migration docker to containerd
you have to configure the KubeletConfiguration to use the “containerd” endpoint. The KubeletConfiguration is typically located at /var/lib/kubelet/config.yaml
:
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
containerRuntimeEndpoint: "unix:///run/containerd/containerd.sock"
Because /var/lib/docker
is mounted on a separate disk, switching to containerd requires navigating to the root directory of containerd.
kubevirt
- containerDisk data persistent
- The containerDisk feature provides the ability to store and distribute VM disks in the container image registry. containerDisks can be assigned to VMs in the disks section of the VirtualMachineInstance spec.containerDisks are ephemeral storage devices that can be assigned to any number of active VirtualMachineInstances.We can persist data locally through incremental backups.
- hostdisk support qcow2 format
- hostdisk support hostpath capacity expansion
Storage Solution
VM Image storage Soultion
In KubeVirt, the original virtual machine image file is incorporated into the /disk path of the Docker base image and subsequently pushed to the image repository for utilization in virtual machine creatio.
Example: we could Inject a local VirtualMachineInstance disk into a container image
cat << END > Dockerfile
FROM scratch
ADD --chown=107:107 almalinux.qcow2 /disk/
END
docker build -t kubevirt/alamlinux:latest .
When initiating a virtual machine, a Virtual Machine Instance (VMI) Custom Resource Definition (CRD) is created, capturing the specified virtual machine image’s name. Subsequent to VMI creation, the virt-controller generates a corresponding virt-launcher pod for the VMI. This pod comprises threee containers: compute container hosting the compute process for virt-launcher, named container-disk, responsible for managing the storage of the virtual machine image and guest-console-log container. The imageName of the container-disk container corresponds to the virtual machine image name recorded in the VMI. Once the virt-launcher pod is created, kubelet retrieves the container-disk image and initiates the container-disk container. During startup, the container-disk consistently monitors the disk_0.sock file under the -copy-path, with the sock file mapped to the path /var/run/kubevirt/container-disk/{vmi-uuid}/ on the host machine through hostPath.
To facilitate the retrieval of necessary information during virtual machine creation, the virt-handler pod utilizes HostPid, enabling visibility of the host machine’s pid and mount details within the virt-handler container. During the virtual machine creation process, virt-handler identifies the pid of the container-disk process by referencing the disk_0.sock file of the VMI. It proceeds to determine the disk number of the container-disk container’s root disk using /proc/{pid}/mountInfo. Subsequently, by cross-referencing the disk number of the container-disk root disk with the mount information of the host machine , it pinpoints the physical location of the container-disk root disk. Finally, it constructs the path for the virtual machine image file (/disk/disk.qcow2), retrieves the actual storage location (sourceFile) of the original virtual machine image on the host machine, and mounts the sourceFile to the targetFile for subsequent use as a backingFile during virtual machine creation.
Host Disk Storage
A hostDisk volume type provides the ability to create or use a disk image located somewhere on a node. It works similar to a hostPath in Kubernetes and provides two usage types:
- DiskOrCreate if a disk image does not exist at a given location then create one
- Disk a disk image must exist at a given location
need to enable the HostDisk feature gate.
Currently, hostdisk feature has some limitations. The expansion of hostdisk is only supported in the manner of using Persistent Volume Claims (PVC), and the disk format is limited to raw files.
Details regarding the above will be elaborated in the Feature Expansion section.
Feature Expansion
Support VM static expansion
The CPU/Mem is also provided with a synchronous interface when the CPU/Mem disk is stopped and expanded. The CPU hotplug feature was introduced in KubeVirt v1. 0, making it possible to configure the VM workload to allow for adding or removing virtual CPUs while the VM is running,While the current version supports online expansion, we still opt for static expansion, primarily due to the temporary nature of VMs. The challenge here is that when resources are insufficient, the VM will not start.
hostdisk support qcow2 and online expand
The current hostdisk has some limitations. The expansion of hostdisk is only supported in the manner of using Persistent Volume Claims (PVC), and the disk is limited to raw format,To implement this feature, we made minor adjustments to all components.
cold migration
We refrain from employing live migration capabilities due to their complexity and several limitations in our specific scenario. Instead, with data locally persisted and VMs scheduled in a fixed manner, we utilize cold migration through the rsync command.
Others
In addition to the enhanced features mentioned earlier, we have integrated support for both static and dynamic addition or removal of host disks for virtual machines, password reset capabilities, pass-through of physical machine disks, and addressed various user requirements to deliver a more versatile and comprehensive usage experience.
Conclusion
KubeVirt simplifies running virtual machines on Kubernetes, making it as easy as managing containers. It provides a cloud-native approach to managing virtual machines. KubeVirt addresses the challenge of unifying the management of virtual machines and containers, effectively harnessing the strengths of both. However, there is still a long way to go in practice.
https://github.com/k8snetworkplumbingwg/multus-cni/issues/1132
https://segmentfault.com/a/1190000040926384/en
https://github.com/containerd/containerd/issues/9139
https://github.com/containerd/containerd/blob/main/docs/cri/config.md