Refactor containerdisks by accessing the artifacts via proc and pid namespace sharing #11845

alicefr · 2024-05-03T13:04:13Z

What this PR does

Before this PR:
Currently, containerdisks artifacts are located in a separate container respect to the compute container, and in order to make them available to virt-launcher, virt-handler needs to bind mount them in the container filesystem.

This mechanism is highly complex, requires the coordination between handler and launcher and can possibly creates clean-up issues if the handler doesn't correctly unmount the artifacts at container deletion

After this PR:
This PR introduces a new mechanism for accessing containerdisk. Since the containerdisk container is deployed in the same pod as virt-laucher, it is possible to allow the PID namespace sharing between the containers in the same pod. In this way, virt-launcher can access the containerdisk artifacts via the proc filesystem, e.g /proc/<pid>/root/<path-to-artifacts>

The coordination mechanism between launcher and handler is no longer needed. Therefore, we can further simplify the containerdisk binary. Additionally, we completely remove the C program and have replaced it with a simpler Go program which simply created a pidfile.

In order to avoid any conflicts with upgrades and migrations, the new setup create a symlink between the location of the artifacts via proc and the old location.

This mechanism has been initially proposed by @rmohr in #10291

Fixes #

Release note

Access container disks via proc filesystem and the pid namespace sharing

kubevirt-bot · 2024-05-03T13:04:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from alicefr. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alicefr · 2024-05-03T13:06:12Z

/test pull-kubevirt-e2e-k8s-1.30-sig-operator

rmohr

Thank you Alice! This is exactly how I was hoping that it would play out!

This mechanism is highly complex, requires the coordination between handler and launcher and can possibly creates clean-up issues if the handler doesn't correctly unmount the artifacts at container deletion

Also note that it removes all ways to to escape the container if we have a bug in the context of containerdisks (we had the one CVE which allowed that)

rmohr · 2024-05-03T15:35:19Z

cmd/container-disk-v3alpha/BUILD.bazel

+go_library(
+ name = "go_default_library",
+ srcs = ["container-disk.go"],
+ importpath = "kubevirt.io/kubevirt/cmd/container-disk-v3alpha",


I think this is the final change which allows us to drop the alpha.

fabiand · 2024-05-05T14:46:13Z

@alicefr this is great work!

0xFelix

Great initiative!

0xFelix · 2024-05-06T09:59:54Z

hack/build-go.sh

This file could really use a cleanup, in another PR though.

0xFelix · 2024-05-06T10:00:08Z

cmd/container-disk-v3alpha/container-disk.go

+)
+
+func main() {
+ noOp := flag.Bool("no-op", false, "program exits immediatly")


Suggested change

noOp := flag.Bool("no-op", false, "program exits immediatly")

noOp := flag.Bool("no-op", false, "program exits immediately")

0xFelix · 2024-05-06T10:14:40Z

pkg/container-disk/container-disk.go

+ return "", fmt.Errorf("more than one file found in folder %s, only one disk is allowed", path)
+ }
+
+ return filepath.Join(fullPath, files[0].Name()), nil


What if there are multiple files in a containerdisk? Is that a supportable use case? IIRC in CDI you can also choose which disk to import?

Well, the old code only supported a single image per containerdisk. I haven't changed that.

0xFelix · 2024-05-06T10:16:21Z

pkg/container-disk/validation_test.go

Keep these tests in another file? container-disk_test.go?

mhenriks · 2024-05-07T02:33:20Z

cmd/container-disk-v3alpha/container-disk.go

@@ -0,0 1,27 @@
+package main


If I recall the other program was written in C to reduce memory consumption. What will memory consumption be now?

See #2844

Good point. It may be better if we now import less packages, but probably still requires a c application

Thanks for pointing this out. I didn't know that. I'll try to evaluate it

We need a C binary unfortunately, the size of the binary is still large, ~ 24M

alicefr · 2024-05-07T06:12:24Z

As discussed with Roman offline, I'll probably need also to reintroduce the unmount logic. If we have old virt-launcher we still want to clean-up their bindmount. I'll do in the next round

alicefr · 2024-06-14T12:39:48Z

/test pull-kubevirt-e2e-k8s-1.30-sig-operator

alicefr · 2024-06-19T13:47:38Z

/test pull-kubevirt-e2e-k8s-1.30-sig-operator

alicefr · 2024-06-20T07:52:38Z

/test pull-kubevirt-e2e-k8s-1.30-sig-operator

Add unit tests for the new behavior and functions for containerdisks. Signed-off-by: Alice Frosi <[email protected]>

Signed-off-by: Alice Frosi <[email protected]>

In order to set the IOthreads affinity, the handler looks for the QEMU process. However, the affinity is set when the VMI is still not running and it might happen that the QEMU hasn't started yet. Signed-off-by: Alice Frosi <[email protected]>

alicefr · 2024-06-26T07:39:27Z

/test pull-kubevirt-unit-test

kubevirt-bot · 2024-06-26T07:54:54Z

@alicefr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubevirt-unit-test-arm64	`86778bb`	link	false	`/test pull-kubevirt-unit-test-arm64`
pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations	`86778bb`	link	true	`/test pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations`
pull-kubevirt-e2e-k8s-1.30-sig-compute	`86778bb`	link	true	`/test pull-kubevirt-e2e-k8s-1.30-sig-compute`
pull-kubevirt-e2e-arm64	`86778bb`	link	false	`/test pull-kubevirt-e2e-arm64`
pull-kubevirt-e2e-k8s-1.30-sig-storage	`86778bb`	link	true	`/test pull-kubevirt-e2e-k8s-1.30-sig-storage`
pull-kubevirt-e2e-k8s-1.30-sig-compute-serial	`86778bb`	link	true	`/test pull-kubevirt-e2e-k8s-1.30-sig-compute-serial`
pull-kubevirt-unit-test	`f2078fa`	link	true	`/test pull-kubevirt-unit-test`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

alicefr · 2024-06-26T13:46:08Z

One current problem with this change is that virt-launcher now signals the containerdisk process to terminate in order to have a graceful shutdown and the pod to be set in completed/succeeded state. I managed to call the new function StopContainerDiskContainers on the method which finalize the virt-launcher pods. However, we don't have any methods for the target pod during a failed migration. Any idea how to solve this? The only solution I can think of is to add a new methods in the launcher-hanlder protocol.

alicefr · 2024-06-26T14:00:17Z

One current problem with this change is that virt-launcher now signals the containerdisk process to terminate in order to have a graceful shutdown and the pod to be set in completed/succeeded state. I managed to call the new function StopContainerDiskContainers on the method which finalize the virt-launcher pods. However, we don't have any methods for the target pod during a failed migration. Any idea how to solve this? The only solution I can think of is to add a new methods in the launcher-hanlder protocol.

Another alternative is searching all the container disk process by executable name. We are already using this method in virt-handler to find the QEMU process. This would probably require some refactoring there, but I think it is the cleanest solution.

kubevirt-bot · 2024-06-27T01:22:29Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

vladikr · 2024-08-07T21:03:43Z

One current problem with this change is that virt-launcher now signals the containerdisk process to terminate in order to have a graceful shutdown and the pod to be set in completed/succeeded state. I managed to call the new function StopContainerDiskContainers on the method which finalize the virt-launcher pods. However, we don't have any methods for the target pod during a failed migration. Any idea how to solve this? The only solution I can think of is to add a new methods in the launcher-hanlder protocol.

Previously we had the socket. We continuously monitored its existence and terminated when it was gone, we can now do the same with the pid file, and let virt-launchers monitor to remove it.
Would that be sufficient?

rmohr · 2024-08-07T21:05:05Z

Previously we had the socket. We continuously monitored its existence and terminated when it was gone, we can now do the same with the pid file, and let virt-launchers monitor to remove it.
Would that be sufficient?

Sounds like this would work as well. No need for the socket trick anymore ...

vladikr · 2024-08-07T21:11:45Z

Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible.

Also, are we sure that sharing the pid namespace won't be an area of concern for some?
An alternative could be to create the ephemeral file in the context of the container disk, pre-open both the backing and the leaf files and advertise the FDs to the virt-launcher.
Then use the virDomainFDAssociate API and pass it to qemu via <source type=file fdgroup=[fds]>

alicefr · 2024-08-08T09:47:47Z

Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible.

I need to double check but I was using a relatively updated kubevirtci version.

Also, are we sure that sharing the pid namespace won't be an area of concern for some? An alternative could be to create the ephemeral file in the context of the container disk, pre-open both the backing and the leaf files and advertise the FDs to the virt-launcher. Then use the virDomainFDAssociate API and pass it to qemu via <source type=file fdgroup=[fds]>

Not sure if sharing the same namespace inside the same body is a real concern. A malicious software could now access the filesystem of all the containe in the same pod. However, those container share quite a lot infrastructure.

Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher.

rmohr · 2024-08-08T16:55:43Z

Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible.

Can there be a relation? I think here we mostly talk about the pid "namespace" and not a cgroup.

vladikr · 2024-08-12T11:57:28Z

Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible.

Can there be a relation? I think here we mostly talk about the pid "namespace" and not a cgroup.

I wasn't certain because there were issues when we were working on separating qemu CPUs into a separate container. However, I tested this solution and it worked as expected.

vladikr · 2024-08-12T12:06:59Z

Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible.

I need to double check but I was using a relatively updated kubevirtci version.

Also, are we sure that sharing the pid namespace won't be an area of concern for some? An alternative could be to create the ephemeral file in the context of the container disk, pre-open both the backing and the leaf files and advertise the FDs to the virt-launcher. Then use the virDomainFDAssociate API and pass it to qemu via <source type=file fdgroup=[fds]>

Not sure if sharing the same namespace inside the same body is a real concern. A malicious software could now access the filesystem of all the containe in the same pod. However, those container share quite a lot infrastructure.

You're right, it might not be an issue in our case. I just wanted to make sure that we know all the weak points before we are moving away from the default pod configuration.

Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher.

Why do you think virt-handler would be needed? We can advertise the FDs the same way as we advertise the pid file in the solution. Or do you think otherwise?

alicefr · 2024-08-12T15:24:44Z

Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher.

Why do you think virt-handler would be needed? We can advertise the FDs the same way as we advertise the pid file in the solution. Or do you think otherwise?

Who will be opening the disk image? I think it needs to be virt-handler and then we need to inject a copy of the fd in the virt-launcher process (I haven't tried this option yet) and then libvirt could possibly find it

vladikr · 2024-08-12T22:42:16Z

Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher.

Why do you think virt-handler would be needed? We can advertise the FDs the same way as we advertise the pid file in the solution. Or do you think otherwise?

Who will be opening the disk image? I think it needs to be virt-handler and then we need to inject a copy of the fd in the virt-launcher process (I haven't tried this option yet) and then libvirt could possibly find it

I think the container disk process can open these images and communicate the FDs over to the compute container via the shared dir.

alicefr · 2024-08-13T06:20:48Z

Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher.

Why do you think virt-handler would be needed? We can advertise the FDs the same way as we advertise the pid file in the solution. Or do you think otherwise?

Who will be opening the disk image? I think it needs to be virt-handler and then we need to inject a copy of the fd in the virt-launcher process (I haven't tried this option yet) and then libvirt could possibly find it

I think the container disk process can open these images and communicate the FDs over to the compute container via the shared dir.

We could implement something like this but you still require a socket to exchange the fd using the ancillary data. This option doesn't require virt-handler and the shared proc namespace, but it sounds slightly more complicated.
But both option are doable.

Take into account that you still to keep the old unmount logic because of the upgrade in order to cleanup old virt-launchers.

alicefr · 2024-08-13T07:57:35Z

@vladikr another disadvantage of using FDs over this option is the migration. This solution keeps the same old path with the symlink, hence we don't need to adjust the migration xml. If you use the fd, you need the change the xml for using the fd.

UPDATE: probably a symlink to the /proc/<pid>/fd/<fd> would also work for this option, avoiding the issue to update the migration xml.

xpivarc · 2024-08-13T11:52:58Z

Note that Kubernetes is going to support OCI artifacts as volume now so maybe we can simplify just to simple "use it".

alicefr · 2024-08-13T12:14:42Z

Note that Kubernetes is going to support OCI artifacts as volume now so maybe we can simplify just to simple "use it".

Do you know if containerdisk can be used "as they are" as OCI artifacts? Are OCI artifacts shareable on the same host like container images?

akalenyu · 2024-09-02T15:16:37Z

Note that Kubernetes is going to support OCI artifacts as volume now so maybe we can simplify just to simple "use it".

Do you know if containerdisk can be used "as they are" as OCI artifacts? Are OCI artifacts shareable on the same host like container images?

good question.. they are v1.VolumeSource just like a pvc
https://github.com/kubernetes/api/blob/release-1.31/core/v1/types.go#L200
but looks like they are mounted ro,noexec:
https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/
https://kubernetes.io/docs/concepts/storage/volumes/#image

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  volumes:
  - name: oci-volume
    image:
      reference: "quay.io/containerdisks/fedora@sha256:21a6f3e628cde7ec7afc9aedf69423924322225d236e78a1f90cf6f4d980f9b3"
      pullPolicy: IfNotPresent
  containers:
  - name: my-container
    image: busybox
    volumeMounts:
    - mountPath: /data
      name: oci-volume

akalenyu · 2024-09-02T15:20:44Z

pkg/virt-handler/isolation/validation.go

@@ -17,7 17,7 @@ const (
 QEMUIMGPath = "/usr/bin/qemu-img"
 )

-func GetImageInfo(imagePath string, context IsolationResult, config *v1.DiskVerification) (*containerdisk.DiskInfo, error) {
+func GetImageInfo(imagePath string, context IsolationResult, config *v1.DiskVerification) (*containerdisk.ImgInfo, error) {


Think you eliminated the only usage of this func

vladikr · 2024-09-11T12:42:35Z

I think we should move forward with this PR - whenever you have time @alicefr
I tested this manually last week as part of the standalone virt-launcher effort and it works great. (some additional changes required)

This could serve as a stepping stone toward using the OCI volumes.

akalenyu · 2024-09-11T12:53:05Z

I think we should move forward with this PR - whenever you have time @alicefr I tested this manually last week as part of the standalone virt-launcher effort and it works great. (some additional changes required)

This could serve as a stepping stone toward using the OCI volumes.

I am a little worried about the readonly part in OCI volumes. Waiting for @alicefr's take
#11845 (comment)

vladikr · 2024-09-11T15:37:04Z

I think we should move forward with this PR - whenever you have time @alicefr I tested this manually last week as part of the standalone virt-launcher effort and it works great. (some additional changes required)
This could serve as a stepping stone toward using the OCI volumes.

I am a little worried about the readonly part in OCI volumes. Waiting for @alicefr's take #11845 (comment)

Well, OCI Volumes are not part of this PR , so it's not a blocker.
In general, we don't open the disk.img for rw, it should always stay read only anyway.

rmohr · 2024-09-11T15:54:52Z

In general, we don't open the disk.img for rw, it should always stay read only anyway.

Yeah just to provide a little bit more context: we mount it ro always, and then write via an overlay into a temporary directory (to avoid writing into container layers in general).

kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels May 3, 2024

alicefr marked this pull request as draft May 3, 2024 13:04

kubevirt-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL labels May 3, 2024

kubevirt-bot requested review from 0xFelix, victortoso and xpivarc May 3, 2024 13:04

kubevirt-bot added kind/build-change Categorizes PRs as related to changing build files of virt-* components sig/buildsystem Denotes an issue or PR that relates to changes in the build system. sig/compute labels May 3, 2024

alicefr changed the title ~~Refacto containerdisks by accessing the artifact via proc and pid namespace sharing~~ Refactor containerdisks by accessing the artifact via proc and pid namespace sharing May 3, 2024

alicefr changed the title ~~Refactor containerdisks by accessing the artifact via proc and pid namespace sharing~~ Refactor containerdisks by accessing the artifacts via proc and pid namespace sharing May 3, 2024

rmohr reviewed May 3, 2024

View reviewed changes

0xFelix reviewed May 6, 2024

View reviewed changes

mhenriks reviewed May 7, 2024

View reviewed changes

kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2024

alicefr force-pushed the refactor-cd branch from 8e62784 to 2bc4844 Compare June 14, 2024 12:37

kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 14, 2024

alicefr force-pushed the refactor-cd branch from 2bc4844 to 5c03b90 Compare June 19, 2024 13:45

alicefr force-pushed the refactor-cd branch from 5c03b90 to 80fd974 Compare June 20, 2024 07:52

alicefr added 3 commits June 26, 2024 09:37

containerdisk: add unit tests

ec1860a

Add unit tests for the new behavior and functions for containerdisks. Signed-off-by: Alice Frosi <[email protected]>

handler: removed tests for removed functions

b9d7f79

Signed-off-by: Alice Frosi <[email protected]>

alicefr force-pushed the refactor-cd branch from cef57f1 to f2078fa Compare June 26, 2024 07:38

kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 26, 2024

kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2024

akalenyu reviewed Sep 2, 2024

View reviewed changes

	noOp := flag.Bool("no-op", false, "program exits immediatly")
	noOp := flag.Bool("no-op", false, "program exits immediately")

Refactor containerdisks by accessing the artifacts via proc and pid namespace sharing #11845

Are you sure you want to change the base?

Refactor containerdisks by accessing the artifacts via proc and pid namespace sharing #11845

Conversation

alicefr commented May 3, 2024

What this PR does

Release note

kubevirt-bot commented May 3, 2024

alicefr commented May 3, 2024

rmohr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabiand commented May 5, 2024

0xFelix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alicefr commented May 7, 2024

alicefr commented Jun 14, 2024

alicefr commented Jun 19, 2024

alicefr commented Jun 20, 2024

alicefr commented Jun 26, 2024

kubevirt-bot commented Jun 26, 2024

alicefr commented Jun 26, 2024

alicefr commented Jun 26, 2024

kubevirt-bot commented Jun 27, 2024

vladikr commented Aug 7, 2024

rmohr commented Aug 7, 2024

vladikr commented Aug 7, 2024 • edited Loading

alicefr commented Aug 8, 2024

rmohr commented Aug 8, 2024

vladikr commented Aug 12, 2024

vladikr commented Aug 12, 2024

alicefr commented Aug 12, 2024

vladikr commented Aug 12, 2024

alicefr commented Aug 13, 2024 • edited Loading

alicefr commented Aug 13, 2024 • edited Loading

xpivarc commented Aug 13, 2024

alicefr commented Aug 13, 2024

akalenyu commented Sep 2, 2024 • edited Loading

Choose a reason for hiding this comment

vladikr commented Sep 11, 2024

akalenyu commented Sep 11, 2024

vladikr commented Sep 11, 2024

rmohr commented Sep 11, 2024

vladikr commented Aug 7, 2024 •

edited

Loading

alicefr commented Aug 13, 2024 •

edited

Loading

alicefr commented Aug 13, 2024 •

edited

Loading

akalenyu commented Sep 2, 2024 •

edited

Loading