-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor containerdisks by accessing the artifacts via proc and pid namespace sharing #11845
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test pull-kubevirt-e2e-k8s-1.30-sig-operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Alice! This is exactly how I was hoping that it would play out!
This mechanism is highly complex, requires the coordination between handler and launcher and can possibly creates clean-up issues if the handler doesn't correctly unmount the artifacts at container deletion
Also note that it removes all ways to to escape the container if we have a bug in the context of containerdisks (we had the one CVE which allowed that)
go_library( | ||
name = "go_default_library", | ||
srcs = ["container-disk.go"], | ||
importpath = "kubevirt.io/kubevirt/cmd/container-disk-v3alpha", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the final change which allows us to drop the alpha.
@alicefr this is great work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great initiative!
hack/build-go.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file could really use a cleanup, in another PR though.
) | ||
|
||
func main() { | ||
noOp := flag.Bool("no-op", false, "program exits immediatly") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noOp := flag.Bool("no-op", false, "program exits immediatly") | |
noOp := flag.Bool("no-op", false, "program exits immediately") |
return "", fmt.Errorf("more than one file found in folder %s, only one disk is allowed", path) | ||
} | ||
|
||
return filepath.Join(fullPath, files[0].Name()), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there are multiple files in a containerdisk? Is that a supportable use case? IIRC in CDI you can also choose which disk to import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the old code only supported a single image per containerdisk. I haven't changed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep these tests in another file? container-disk_test.go
?
@@ -0,0 1,27 @@ | |||
package main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I recall the other program was written in C to reduce memory consumption. What will memory consumption be now?
See #2844
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. It may be better if we now import less packages, but probably still requires a c application
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. I didn't know that. I'll try to evaluate it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a C binary unfortunately, the size of the binary is still large, ~ 24M
As discussed with Roman offline, I'll probably need also to reintroduce the unmount logic. If we have old virt-launcher we still want to clean-up their bindmount. I'll do in the next round |
/test pull-kubevirt-e2e-k8s-1.30-sig-operator |
/test pull-kubevirt-e2e-k8s-1.30-sig-operator |
/test pull-kubevirt-e2e-k8s-1.30-sig-operator |
Add unit tests for the new behavior and functions for containerdisks. Signed-off-by: Alice Frosi <[email protected]>
Signed-off-by: Alice Frosi <[email protected]>
In order to set the IOthreads affinity, the handler looks for the QEMU process. However, the affinity is set when the VMI is still not running and it might happen that the QEMU hasn't started yet. Signed-off-by: Alice Frosi <[email protected]>
/test pull-kubevirt-unit-test |
@alicefr: The following tests failed, say
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
One current problem with this change is that virt-launcher now signals the containerdisk process to terminate in order to have a graceful shutdown and the pod to be set in completed/succeeded state. I managed to call the new function |
Another alternative is searching all the container disk process by executable name. We are already using this method in virt-handler to find the QEMU process. This would probably require some refactoring there, but I think it is the cleanest solution. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Previously we had the socket. We continuously monitored its existence and terminated when it was gone, we can now do the same with the pid file, and let virt-launchers monitor to remove it. |
Sounds like this would work as well. No need for the socket trick anymore ... |
Are we certain that this approach will work with cgroups v2? I remember that there were some implications with some files being visible but not accessible. Also, are we sure that sharing the pid namespace won't be an area of concern for some? |
I need to double check but I was using a relatively updated kubevirtci version.
Not sure if sharing the same namespace inside the same body is a real concern. A malicious software could now access the filesystem of all the containe in the same pod. However, those container share quite a lot infrastructure. Passing an FD instead of path is always possible, it however requires some intervention of virt-handler that which this version isn't necessary. It is everything handed by virt-launcher. |
Can there be a relation? I think here we mostly talk about the pid "namespace" and not a cgroup. |
I wasn't certain because there were issues when we were working on separating qemu CPUs into a separate container. However, I tested this solution and it worked as expected. |
You're right, it might not be an issue in our case. I just wanted to make sure that we know all the weak points before we are moving away from the default pod configuration.
Why do you think virt-handler would be needed? We can advertise the FDs the same way as we advertise the pid file in the solution. Or do you think otherwise? |
Who will be opening the disk image? I think it needs to be virt-handler and then we need to inject a copy of the fd in the virt-launcher process (I haven't tried this option yet) and then libvirt could possibly find it |
I think the container disk process can open these images and communicate the FDs over to the compute container via the shared dir. |
We could implement something like this but you still require a socket to exchange the fd using the ancillary data. This option doesn't require virt-handler and the shared proc namespace, but it sounds slightly more complicated. Take into account that you still to keep the old unmount logic because of the upgrade in order to cleanup old virt-launchers. |
@vladikr another disadvantage of using FDs over this option is the migration. This solution keeps the same old path with the symlink, hence we don't need to adjust the migration xml. If you use the fd, you need the change the xml for using the fd. UPDATE: probably a symlink to the |
Note that Kubernetes is going to support OCI artifacts as volume now so maybe we can simplify just to simple "use it". |
Do you know if containerdisk can be used "as they are" as OCI artifacts? Are OCI artifacts shareable on the same host like container images? |
good question.. they are
|
@@ -17,7 17,7 @@ const ( | |||
QEMUIMGPath = "/usr/bin/qemu-img" | |||
) | |||
|
|||
func GetImageInfo(imagePath string, context IsolationResult, config *v1.DiskVerification) (*containerdisk.DiskInfo, error) { | |||
func GetImageInfo(imagePath string, context IsolationResult, config *v1.DiskVerification) (*containerdisk.ImgInfo, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you eliminated the only usage of this func
I think we should move forward with this PR - whenever you have time @alicefr This could serve as a stepping stone toward using the OCI volumes. |
I am a little worried about the readonly part in OCI volumes. Waiting for @alicefr's take |
Well, OCI Volumes are not part of this PR , so it's not a blocker. |
Yeah just to provide a little bit more context: we mount it ro always, and then write via an overlay into a temporary directory (to avoid writing into container layers in general). |
What this PR does
Before this PR:
Currently, containerdisks artifacts are located in a separate container respect to the
compute
container, and in order to make them available to virt-launcher, virt-handler needs to bind mount them in the container filesystem.This mechanism is highly complex, requires the coordination between handler and launcher and can possibly creates clean-up issues if the handler doesn't correctly unmount the artifacts at container deletion
After this PR:
This PR introduces a new mechanism for accessing containerdisk. Since the containerdisk container is deployed in the same pod as virt-laucher, it is possible to allow the PID namespace sharing between the containers in the same pod. In this way, virt-launcher can access the containerdisk artifacts via the proc filesystem, e.g
/proc/<pid>/root/<path-to-artifacts>
The coordination mechanism between launcher and handler is no longer needed. Therefore, we can further simplify the
containerdisk
binary. Additionally, we completely remove the C program and have replaced it with a simpler Go program which simply created a pidfile.In order to avoid any conflicts with upgrades and migrations, the new setup create a symlink between the location of the artifacts via proc and the old location.
This mechanism has been initially proposed by @rmohr in #10291
Fixes #
Release note