Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor containerdisks by accessing the artifacts via proc and pid namespace sharing #11845

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

alicefr
Copy link
Member

@alicefr alicefr commented May 3, 2024

What this PR does

Before this PR:
Currently, containerdisks artifacts are located in a separate container respect to the compute container, and in order to make them available to virt-launcher, virt-handler needs to bind mount them in the container filesystem.

This mechanism is highly complex, requires the coordination between handler and launcher and can possibly creates clean-up issues if the handler doesn't correctly unmount the artifacts at container deletion

After this PR:
This PR introduces a new mechanism for accessing containerdisk. Since the containerdisk container is deployed in the same pod as virt-laucher, it is possible to allow the PID namespace sharing between the containers in the same pod. In this way, virt-launcher can access the containerdisk artifacts via the proc filesystem, e.g /proc/<pid>/root/<path-to-artifacts>

The coordination mechanism between launcher and handler is no longer needed. Therefore, we can further simplify the containerdisk binary. Additionally, we completely remove the C program and have replaced it with a simpler Go program which simply created a pidfile.

In order to avoid any conflicts with upgrades and migrations, the new setup create a symlink between the location of the artifacts via proc and the old location.

This mechanism has been initially proposed by @rmohr in #10291

Fixes #

Release note

Access container disks via proc filesystem and the pid namespace sharing

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels May 3, 2024
@alicefr alicefr marked this pull request as draft May 3, 2024 13:04
@kubevirt-bot kubevirt-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL labels May 3, 2024
@kubevirt-bot kubevirt-bot added kind/build-change Categorizes PRs as related to changing build files of virt-* components sig/buildsystem Denotes an issue or PR that relates to changes in the build system. sig/compute labels May 3, 2024
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from alicefr. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alicefr
Copy link
Member Author

alicefr commented May 3, 2024

/test pull-kubevirt-e2e-k8s-1.30-sig-operator

@alicefr alicefr changed the title Refacto containerdisks by accessing the artifact via proc and pid namespace sharing Refactor containerdisks by accessing the artifact via proc and pid namespace sharing May 3, 2024
@alicefr alicefr changed the title Refactor containerdisks by accessing the artifact via proc and pid namespace sharing Refactor containerdisks by accessing the artifacts via proc and pid namespace sharing May 3, 2024
Copy link
Member

@rmohr rmohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Alice! This is exactly how I was hoping that it would play out!

This mechanism is highly complex, requires the coordination between handler and launcher and can possibly creates clean-up issues if the handler doesn't correctly unmount the artifacts at container deletion

Also note that it removes all ways to to escape the container if we have a bug in the context of containerdisks (we had the one CVE which allowed that)

go_library(
name = "go_default_library",
srcs = ["container-disk.go"],
importpath = "kubevirt.io/kubevirt/cmd/container-disk-v3alpha",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the final change which allows us to drop the alpha.

@fabiand
Copy link
Member

fabiand commented May 5, 2024

@alicefr this is great work!

Copy link
Member

@0xFelix 0xFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great initiative!

hack/build-go.sh Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file could really use a cleanup, in another PR though.

)

func main() {
noOp := flag.Bool("no-op", false, "program exits immediatly")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
noOp := flag.Bool("no-op", false, "program exits immediatly")
noOp := flag.Bool("no-op", false, "program exits immediately")

return "", fmt.Errorf("more than one file found in folder %s, only one disk is allowed", path)
}

return filepath.Join(fullPath, files[0].Name()), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are multiple files in a containerdisk? Is that a supportable use case? IIRC in CDI you can also choose which disk to import?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the old code only supported a single image per containerdisk. I haven't changed that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep these tests in another file? container-disk_test.go?

@@ -0,0 1,27 @@
package main
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall the other program was written in C to reduce memory consumption. What will memory consumption be now?

See #2844

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It may be better if we now import less packages, but probably still requires a c application

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I didn't know that. I'll try to evaluate it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a C binary unfortunately, the size of the binary is still large, ~ 24M

@alicefr
Copy link
Member Author

alicefr commented May 7, 2024

As discussed with Roman offline, I'll probably need also to reintroduce the unmount logic. If we have old virt-launcher we still want to clean-up their bindmount. I'll do in the next round

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 21, 2024
@kubevirt-bot kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. and removed dco-signoff: yes Indicates the PR's author has DCO signed all their commits. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 14, 2024
@alicefr
Copy link
Member Author

alicefr commented Jun 14, 2024

/test pull-kubevirt-e2e-k8s-1.30-sig-operator

@alicefr
Copy link
Member Author

alicefr commented Jun 19, 2024

/test pull-kubevirt-e2e-k8s-1.30-sig-operator

@alicefr
Copy link
Member Author

alicefr commented Jun 20, 2024

/test pull-kubevirt-e2e-k8s-1.30-sig-operator

Add unit tests for the new behavior and functions for containerdisks.

Signed-off-by: Alice Frosi <[email protected]>
In order to set the IOthreads affinity, the handler looks for the QEMU
process. However, the affinity is set when the VMI is still not running
and it might happen that the QEMU hasn't started yet.

Signed-off-by: Alice Frosi <[email protected]>
@kubevirt-bot kubevirt-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 26, 2024
@alicefr
Copy link
Member Author

alicefr commented Jun 26, 2024

/test pull-kubevirt-unit-test

@kubevirt-bot
Copy link
Contributor

@alicefr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubevirt-unit-test-arm64 86778bb link false /test pull-kubevirt-unit-test-arm64
pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations 86778bb link true /test pull-kubevirt-e2e-k8s-1.30-sig-compute-migrations
pull-kubevirt-e2e-k8s-1.30-sig-compute 86778bb link true /test pull-kubevirt-e2e-k8s-1.30-sig-compute
pull-kubevirt-e2e-arm64 86778bb link false /test pull-kubevirt-e2e-arm64
pull-kubevirt-e2e-k8s-1.30-sig-storage 86778bb link true /test pull-kubevirt-e2e-k8s-1.30-sig-storage
pull-kubevirt-e2e-k8s-1.30-sig-compute-serial 86778bb link true /test pull-kubevirt-e2e-k8s-1.30-sig-compute-serial
pull-kubevirt-unit-test f2078fa link true /test pull-kubevirt-unit-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@alicefr
Copy link
Member Author

alicefr commented Jun 26, 2024

One current problem with this change is that virt-launcher now signals the containerdisk process to terminate in order to have a graceful shutdown and the pod to be set in completed/succeeded state. I managed to call the new function StopContainerDiskContainers on the method which finalize the virt-launcher pods. However, we don't have any methods for the target pod during a failed migration. Any idea how to solve this? The only solution I can think of is to add a new methods in the launcher-hanlder protocol.

@alicefr
Copy link
Member Author

alicefr commented Jun 26, 2024

One current problem with this change is that virt-launcher now signals the containerdisk process to terminate in order to have a graceful shutdown and the pod to be set in completed/succeeded state. I managed to call the new function StopContainerDiskContainers on the method which finalize the virt-launcher pods. However, we don't have any methods for the target pod during a failed migration. Any idea how to solve this? The only solution I can think of is to add a new methods in the launcher-hanlder protocol.

Another alternative is searching all the container disk process by executable name. We are already using this method in virt-handler to find the QEMU process. This would probably require some refactoring there, but I think it is the cleanest solution.

@kubevirt-bot kubevirt-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 27, 2024
@kubevirt-bot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@vladikr
Copy link
Member

vladikr commented Aug 7, 2024

One current problem with this change is that virt-launcher now signals the containerdisk process to terminate in order to have a graceful shutdown and the pod to be set in completed/succeeded state. I managed to call the new function StopContainerDiskContainers on the method which finalize the virt-launcher pods. However, we don't have any methods for the target pod during a failed migration. Any idea how to solve this? The only solution I can think of is to add a new methods in the launcher-hanlder protocol.

Previously we had the socket. We continuously monitored its existence and terminated when it was gone, we can now do the same with the pid file, and let virt-launchers monitor to remove it.
Would that be sufficient?

@rmohr
Copy link
Member

rmohr commented Aug 7, 2024

Previously we had the socket. We continuously monitored its existence and terminated when it was gone, we can now do the same with the pid file, and let virt-launchers monitor to remove it.
Would that be sufficient?

Sounds like this would work as well. No need for the socket trick anymore ...

@vladikr
Copy link
Member

vladikr commented Aug 7, 2024

Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible.

Also, are we sure that sharing the pid namespace won't be an area of concern for some?
An alternative could be to create the ephemeral file in the context of the container disk, pre-open both the backing and the leaf files and advertise the FDs to the virt-launcher.
Then use the virDomainFDAssociate API and pass it to qemu via <source type=file fdgroup=[fds]>

@alicefr
Copy link
Member Author

alicefr commented Aug 8, 2024

Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible.

I need to double check but I was using a relatively updated kubevirtci version.

Also, are we sure that sharing the pid namespace won't be an area of concern for some? An alternative could be to create the ephemeral file in the context of the container disk, pre-open both the backing and the leaf files and advertise the FDs to the virt-launcher. Then use the virDomainFDAssociate API and pass it to qemu via <source type=file fdgroup=[fds]>

Not sure if sharing the same namespace inside the same body is a real concern. A malicious software could now access the filesystem of all the containe in the same pod. However, those container share quite a lot infrastructure.

Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher.

@rmohr
Copy link
Member

rmohr commented Aug 8, 2024

Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible.

Can there be a relation? I think here we mostly talk about the pid "namespace" and not a cgroup.

@vladikr
Copy link
Member

vladikr commented Aug 12, 2024

Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible.

Can there be a relation? I think here we mostly talk about the pid "namespace" and not a cgroup.

I wasn't certain because there were issues when we were working on separating qemu CPUs into a separate container. However, I tested this solution and it worked as expected.

@vladikr
Copy link
Member

vladikr commented Aug 12, 2024

Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible.

I need to double check but I was using a relatively updated kubevirtci version.

Also, are we sure that sharing the pid namespace won't be an area of concern for some? An alternative could be to create the ephemeral file in the context of the container disk, pre-open both the backing and the leaf files and advertise the FDs to the virt-launcher. Then use the virDomainFDAssociate API and pass it to qemu via <source type=file fdgroup=[fds]>

Not sure if sharing the same namespace inside the same body is a real concern. A malicious software could now access the filesystem of all the containe in the same pod. However, those container share quite a lot infrastructure.

You're right, it might not be an issue in our case. I just wanted to make sure that we know all the weak points before we are moving away from the default pod configuration.

Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher.

Why do you think virt-handler would be needed? We can advertise the FDs the same way as we advertise the pid file in the solution. Or do you think otherwise?

@alicefr
Copy link
Member Author

alicefr commented Aug 12, 2024

Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher.

Why do you think virt-handler would be needed? We can advertise the FDs the same way as we advertise the pid file in the solution. Or do you think otherwise?

Who will be opening the disk image? I think it needs to be virt-handler and then we need to inject a copy of the fd in the virt-launcher process (I haven't tried this option yet) and then libvirt could possibly find it

@vladikr
Copy link
Member

vladikr commented Aug 12, 2024

Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher.

Why do you think virt-handler would be needed? We can advertise the FDs the same way as we advertise the pid file in the solution. Or do you think otherwise?

Who will be opening the disk image? I think it needs to be virt-handler and then we need to inject a copy of the fd in the virt-launcher process (I haven't tried this option yet) and then libvirt could possibly find it

I think the container disk process can open these images and communicate the FDs over to the compute container via the shared dir.

@alicefr
Copy link
Member Author

alicefr commented Aug 13, 2024

Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher.

Why do you think virt-handler would be needed? We can advertise the FDs the same way as we advertise the pid file in the solution. Or do you think otherwise?

Who will be opening the disk image? I think it needs to be virt-handler and then we need to inject a copy of the fd in the virt-launcher process (I haven't tried this option yet) and then libvirt could possibly find it

I think the container disk process can open these images and communicate the FDs over to the compute container via the shared dir.

We could implement something like this but you still require a socket to exchange the fd using the ancillary data. This option doesn't require virt-handler and the shared proc namespace, but it sounds slightly more complicated.
But both option are doable.

Take into account that you still to keep the old unmount logic because of the upgrade in order to cleanup old virt-launchers.

@alicefr
Copy link
Member Author

alicefr commented Aug 13, 2024

@vladikr another disadvantage of using FDs over this option is the migration. This solution keeps the same old path with the symlink, hence we don't need to adjust the migration xml. If you use the fd, you need the change the xml for using the fd.

UPDATE: probably a symlink to the /proc/<pid>/fd/<fd> would also work for this option, avoiding the issue to update the migration xml.

@xpivarc
Copy link
Member

xpivarc commented Aug 13, 2024

Note that Kubernetes is going to support OCI artifacts as volume now so maybe we can simplify just to simple "use it".

@alicefr
Copy link
Member Author

alicefr commented Aug 13, 2024

Note that Kubernetes is going to support OCI artifacts as volume now so maybe we can simplify just to simple "use it".

Do you know if containerdisk can be used "as they are" as OCI artifacts? Are OCI artifacts shareable on the same host like container images?

@akalenyu
Copy link
Contributor

akalenyu commented Sep 2, 2024

Note that Kubernetes is going to support OCI artifacts as volume now so maybe we can simplify just to simple "use it".

Do you know if containerdisk can be used "as they are" as OCI artifacts? Are OCI artifacts shareable on the same host like container images?

good question.. they are v1.VolumeSource just like a pvc
https://github.com/kubernetes/api/blob/release-1.31/core/v1/types.go#L200
but looks like they are mounted ro,noexec:
https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/
https://kubernetes.io/docs/concepts/storage/volumes/#image

apiVersion: v1
kind: Pod
metadata:
  name: example-pod
spec:
  volumes:
  - name: oci-volume
    image:
      reference: "quay.io/containerdisks/fedora@sha256:21a6f3e628cde7ec7afc9aedf69423924322225d236e78a1f90cf6f4d980f9b3"
      pullPolicy: IfNotPresent
  containers:
  - name: my-container
    image: busybox
    volumeMounts:
    - mountPath: /data
      name: oci-volume

@@ -17,7 17,7 @@ const (
QEMUIMGPath = "/usr/bin/qemu-img"
)

func GetImageInfo(imagePath string, context IsolationResult, config *v1.DiskVerification) (*containerdisk.DiskInfo, error) {
func GetImageInfo(imagePath string, context IsolationResult, config *v1.DiskVerification) (*containerdisk.ImgInfo, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you eliminated the only usage of this func

@vladikr
Copy link
Member

vladikr commented Sep 11, 2024

I think we should move forward with this PR - whenever you have time @alicefr
I tested this manually last week as part of the standalone virt-launcher effort and it works great. (some additional changes required)

This could serve as a stepping stone toward using the OCI volumes.

@akalenyu
Copy link
Contributor

I think we should move forward with this PR - whenever you have time @alicefr I tested this manually last week as part of the standalone virt-launcher effort and it works great. (some additional changes required)

This could serve as a stepping stone toward using the OCI volumes.

I am a little worried about the readonly part in OCI volumes. Waiting for @alicefr's take
#11845 (comment)

@vladikr
Copy link
Member

vladikr commented Sep 11, 2024

I think we should move forward with this PR - whenever you have time @alicefr I tested this manually last week as part of the standalone virt-launcher effort and it works great. (some additional changes required)
This could serve as a stepping stone toward using the OCI volumes.

I am a little worried about the readonly part in OCI volumes. Waiting for @alicefr's take #11845 (comment)

Well, OCI Volumes are not part of this PR , so it's not a blocker.
In general, we don't open the disk.img for rw, it should always stay read only anyway.

@rmohr
Copy link
Member

rmohr commented Sep 11, 2024

In general, we don't open the disk.img for rw, it should always stay read only anyway.

Yeah just to provide a little bit more context: we mount it ro always, and then write via an overlay into a temporary directory (to avoid writing into container layers in general).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/build-change Categorizes PRs as related to changing build files of virt-* components needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/buildsystem Denotes an issue or PR that relates to changes in the build system. sig/compute size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants