Windows Hyper-V Container Support For CRI #6862

dcantah · 2022-04-27T05:01:53Z

What is the problem you're trying to solve

We'd like to support launching hypervisor isolated Windows containers through the CRI entry point to light up this scenario for K8s. There's support to launch Hyper-V containers present in Containerd itself via the WithWindowsHyperV client option, as well as the ctr testing tools –isolation flag, however there is nothing in the CRI plugin that makes use of this functionality at the moment.

Describe the solution you'd like

There's a few spots that would need to change to add in "full" support, but at least in the 1.7 timeframe for getting in the minimal amount needed to launch/manage these containers, there's not a great deal.

Initial Support (1.7 timeframe)

Filling in the HyperV runtime spec field

The Windows Containerd shim exposes a SandboxIsolation enum that can be used to tell the shim what kind of container/pod to launch. This field in combination with new runtime class definitions in Containerd is how we can differentiate between process and hypervisor isolation for Windows. Below is an example pod spec and runtime class definition in Containerds config file:

kind: Deployment
metadata:
  name: wcow-test
spec:
  replicas: 2
  selector:
    matchLabels:
      app: wcow
  template:
    metadata:
      labels:
        app: wcow
    spec:
      runtimeClassName: runhcs-wcow-hypervisor  <----------------
      containers:
      - name: servercore
        image: mcr.microsoft.com/windows/servercore:1809
        ports:
        - containerPort: 80
          protocol: TCP

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runhcs-wcow-hypervisor]
        base_runtime_spec = ""
        cni_conf_dir = ""
        cni_max_conf_num = 0
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_path = ""
        runtime_root = ""
        runtime_type = "io.containerd.runhcs.v1"
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runhcs-wcow-hypervisor.options]
          Debug = true
          DebugType = 2
          SandboxImage = "mcr.microsoft.com/windows/servercore:1809"
          SandboxPlatform = "windows/amd64"
          SandboxIsolation = 1 <-------------------

We can also additionally expand on what the default CRI config can be in Containerd for Windows if not supplied in the config file. We would have to continually update this to include new runtimes anytime a new OS release/container image pair is made available.

// DefaultConfig returns default configurations of CRI plugin.
func DefaultConfig() PluginConfig {
     //
     // New Additions
     //
    ws2019Opts := options.Options{
        SandboxImage:     "mcr.microsoft.com/windows/nanoserver:1809",
        SandboxPlatform:  "windows/amd64",
        SandboxIsolation: options.Options_HYPERVISOR,
    }
    ws2022Opts := options.Options{
        SandboxImage:     "mcr.microsoft.com/windows/nanoserver:ltsc2022",
        SandboxPlatform:  "windows/amd64",
        SandboxIsolation: options.Options_HYPERVISOR,
    }
    // 
    // End of new additions
    //
    return PluginConfig{
        CniConfig: CniConfig{
            NetworkPluginBinDir: filepath.Join(os.Getenv("ProgramFiles"), "containerd", "cni", "bin"),
            NetworkPluginConfDir: filepath.Join(os.Getenv("ProgramFiles"), "containerd", "cni", "conf"),
            NetworkPluginMaxConfNum:   1,
            NetworkPluginConfTemplate: "",
        },
        ContainerdConfig: ContainerdConfig{
            Snapshotter:        containerd.DefaultSnapshotter,
            DefaultRuntimeName: "runhcs-wcow-process",
            NoPivot:            false,
            Runtimes: map[string]Runtime{
                "runhcs-wcow-process": {
                    Type:                 "io.containerd.runhcs.v1",
                    ContainerAnnotations: []string{"io.microsoft.container.*"},
                },
                //
                // New additions
                //
                "runhcs-wcow-hypervisor-1809": {
                    Type:                 "io.containerd.runhcs.v1",
                    PodAnnotations:       []string{"io.microsoft.virtualmachine.*"},
                    ContainerAnnotations: []string{"io.microsoft.container.*"},
                    Options:              ws2019Opts,
                },
                "runhcs-wcow-hypervisor-17763": {
                    Type:                 "io.containerd.runhcs.v1",
                    PodAnnotations:       []string{"io.microsoft.virtualmachine.*"},
                    ContainerAnnotations: []string{"io.microsoft.container.*"},
                    Options:              ws2019Opts,
                },
                "runhcs-wcow-hypervisor-20348": {
                    Type:                 "io.containerd.runhcs.v1",
                    PodAnnotations:       []string{"io.microsoft.virtualmachine.*"},
                    ContainerAnnotations: []string{"io.microsoft.container.*"},
                    Options:              ws2022Opts,
                },
                "runhcs-wcow-hypervisor-21H2": {
                    Type:                 "io.containerd.runhcs.v1",
                    PodAnnotations:       []string{"io.microsoft.virtualmachine.*"},
                    ContainerAnnotations: []string{"io.microsoft.container.*"},
                    Options:              ws2022Opts,
                },
                //
                // End of new additions
                //
            },
        },
        … Omitted other fields …
    }
}

Resource Limits For the VM

One way that the Windows shim supports setting resource limits (memory, vcpu count) for the lightweight VM is via annotations. The virtual machine based annotations all begin with io.microsoft.virtualmachine.*, so playing into the last section above would be to allow these annotations via the PodAnnotations and ContainerAnnotations fields as shown.

An example pod spec asking for the VM hosting the containers in the pod to boot with 4GB of memory and 4 vps is below:

apiVersion: v1
kind: Pod
metadata:
  name: wcow-test
  labels:
        app: wcow
  annotations:
          io.microsoft.virtualmachine.computetopology.memory.sizeinmb: "4096"
          io.microsoft.virtualmachine.computetopology.processor.count: "4"
spec:
  replicas: 2
  selector:
    matchLabels:
      app: wcow  
    spec:
      runtimeClassName: runhcs-wcow-hypervisor  <----------------
      containers:
      - name: servercore
        image: mcr.microsoft.com/windows/servercore:1809
        ports:
        - containerPort: 80
          protocol: TCP

Another way resource limits could be set, although the values would be fixed for the duration of a deployment unless Containerd was restarted or the value was overrode by specifying an annotation, would be the vm_process_count and vm_memory_size_in_mb fields that are present in the Windows shim specific options.

This could be extended further by having the runtime class specify the resource limits in the name. For example runhcs-wcow-hypervisor-20348-1vp2gb:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runhcs-wcow-hypervisor-20348-1vp2gb.options]
    Debug = true
    DebugType = 2
    SandboxPlatform = "windows/amd64"
    SandboxIsolation = 1
    VmProcessorCount = 1
    VmMemorySizeInMb = 2048

Testing

This is tricky as Github actions runners don't support nested virtualization, we'll likely need to do something similar to the approach the Windows periodic tests use and allocate az vms to do our bidding (https://github.com/containerd/containerd/blob/main/.github/workflows/windows-periodic.yml). This might be the most work..

"Full Support"

Pulling images that don't match hosts build

One of the pros for Hyper-V containers is that you're not constrained to the Windows hosts build number for image choice (ws2019 host no longer has to only use a 1809/ws2019 image). However, the Windows platform matching code is finnicky and tough to get right, and the main selling point for these containers is really security. I'd be alright punting figuring out the platform package changes until we know what's the right approach, and just get in the work to be able to launch these in general.

Resource Limits Looking Forward

There's platform limitations to supporting vcpu hot-add, but ideally k8s would tally up the total resource limits by adding up the container resource limits in the pod and sending it in some field for Windows. If that does come to fruition then we'll need to do something with this data. Writing this down for future reference mainly

Additional context

Thanks for reading the wall of text :)

Tracking

1.7

Fill in the HyperV runtime spec field
Add runhcs-wcow-hypervisor runtimeclass to the default config
Add new test runs for wcow-hypervisor support

Future

Support pulling images that don't match hosts build number
Better way to handle resource limits for the VM (K8s tally the total like https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/688-pod-overhead#container-runtime-interface-cri)

The text was updated successfully, but these errors were encountered:

dcantah · 2022-04-27T05:02:27Z

cc @kevpar

jterry75 · 2022-04-27T16:51:25Z

I feel like.... we already did this 😭 . Thanks for the detailed write up Danny! I'm in!

dcantah · 2022-04-28T22:23:33Z

cc @marosset @jsturtevant as well

jsturtevant · 2022-04-28T23:11:23Z

Another way resource limits could be set, although the values would be fixed for the duration of a deployment unless Containerd was restarted or the value was overrode by specifying an annotation, would be the vm_process_count and vm_memory_size_in_mb fields that are present in the Windows shim specific options.

This could be extended further by having the runtime class specify the resource limits in the name. For example runhcs-wcow-hypervisor-20348-1vp2gb:

When a user specifies a container (or the sum of containers) that has limits above or below the default specified in the containerd configuration, what will the behavior be?

For example if the pod requests more CPU or memory than the default containerd configuration:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runhcs-wcow-hypervisor.options]
    SandboxIsolation = 1
    VmProcessorCount = 1  
    VmMemorySizeInMb = 2048

With a pod:

apiVersion: v1
kind: Pod
metadata:
  name: wcow-test
  labels:
        app: wcow
spec:
  replicas: 2
  selector:
    matchLabels:
      app: wcow  
    spec:
      runtimeClassName: runhcs-wcow-hypervisor  <----------------
      containers:
      - name: servercore
        image: mcr.microsoft.com/windows/servercore:1809
       resources:
          limits:
            cpu: 2
            memory: 5GiB
          requests:
            cpu: 2
            memory: 5GiB

kevpar · 2022-05-06T06:33:48Z

One piece of this work is in #6901.

It would be nice to have a checklist in the issue for each piece of implementation that needs to be done, with links to the PRs as they are published.

dcantah · 2022-05-06T06:36:49Z

@kevpar Yep, was going to edit this to be like #1920 now that we're agreed on what the minimum work entails.

TBBle · 2022-05-25T11:14:53Z

Since it just came up in #6508, I thought I'd record a thought about the (punted to future) different platform-matcher use-case for Hyper-V isolation, from a question about using crictl pull for a LTSC 2019 image on LTSC 2022 host.

(Quoting myself because it's a bit out-of-context)

One issue is that CRI's PullImage API doesn't currently know that the image is for use in Hyper-V, as this is before that information is made available in a later API call.

I'm not sure if you can populate it with crictl, but for CRI, the ImageSpec has an annotations field which may was intended to be used to carry this sort of information although AFAIK this isn't implemented in containerd. However, similar to #6657, I think the implemented approach is likely to be by reference to an appropriately-configured runtime, which would then influence the container platform matching.

In the discussion of #6491, I think we had agreed that this would be done with a custom matcher. I don't recall any discussion of how this custom matcher would be triggered. At the time I had assumed it'd be the annotation on the ImageSpec (copied from the Pod Spec) but looking at #6657, I suspect the canonical way would be that the Hyper-V isolation runtime is somehow also able to influence the matcher used by PullImage in the same way it's going to be able to influence the snapshotter. It'd be nice if this was magic from enabling Hyper-V isolation, but in the design currently mooted, that's not visible outside the hcsshim-private Options message, so there's non-tivial design work pending for whenever the punt lands. (I suspect general 'chosen CRI runtime provides the matcher' behaviour is probably correct, since that'll also take care of LCOW quite naturally.)

bplasmeijer · 2022-07-06T19:04:11Z

Hey,

Do we need to link this to Azure/AKS#1792?

dcantah · 2022-07-12T05:05:11Z

In the discussion of #6491, I think we had agreed that this would be done with a custom matcher. I don't recall any discussion of how this custom matcher would be triggered. At the time I had assumed it'd be the annotation on the ImageSpec (copied from the Pod Spec) but looking at #6657, I suspect the canonical way would be that the Hyper-V isolation runtime is somehow also able to influence the matcher used by PullImage in the same way it's going to be able to influence the snapshotter. It'd be nice if this was magic from enabling Hyper-V isolation, but in the design currently mooted, that's not visible outside the hcsshim-private Options message, so there's non-tivial design work pending for whenever the punt lands. (I suspect general 'chosen CRI runtime provides the matcher' behaviour is probably correct, since that'll also take care of LCOW quite naturally.)

@TBBle I completely forgot to reply here my apologies.. Your last train of thought is something we're thinking about as the work described in #6657 (and recently implemented as an experimental feature) is really exciting to think about applying for use cases like this. It'd need some k8s work to really fully be usable though, so that punts the usability quite some months out

dcantah · 2022-07-12T05:05:27Z

Hey,

Do we need to link this to Azure/AKS#1792?

Yes, that'd make sense

sparr · 2022-10-20T15:53:57Z

Has there been any progress on "Add new test runs for wcow-hypervisor support"? It looks like those test runs are the only thing in the way of marking this complete for 1.7 milestone.

marosset · 2022-10-20T17:01:11Z

@claudiubelu - FYI

fabi200123 · 2022-10-27T09:06:00Z

@sparr I believe #7025 is the one needed for the "Add new test runs for wcow-hypervisor support" (which was merged yesterday).

TBBle · 2022-10-27T09:51:39Z

A quick note on one of the "future" tasks (not tracked elsewhere AFAIK, so putting it here)

Support pulling images that don't match hosts build number

#6899 has landed (fulfilling the part of #6657 we care about), so we can now have per-runtime snapshotters. However, to use that to deliver the above use-case, we also need a way to provide multiple configurations of the one WCOW snapshotter with different PlatformMatchers. #7431 for host process containers is doing a different thing for its similar use-case though, since in its case the platform is visible in CRI's API, and so the proposal there is for CRI to tell the existing snapshotter to use a different matcher.

AFAIR (I'm still on sabbatical, so "R" is carrying a lot of load in that phrase) we don't currently have a "multiple-config snapshotters" setup, snapshotters register themselves by static string name, which is what the runtime config matches.

So we'd need to teach the WCOW snapshotter ("windows") to register itself a few times with different platform configs (ideally sharing storage? Same underlying instance underneath, anything else will be wasteful). Or perhaps modify snapshot plugin initialisation to be able to produce multiple snapshotters from InitFn (i.e. where Plugin.instance returns an interface{} which we convert to a snapshots.Snapshotter, we need to get multiple snapshotters with different names or something). Both methods are a little messy in the current model, neither jumps out to me as "minimum surprise".

All that said, the matcher is needed by the "pull" operation, which is really "Network to content store", the snapshotter doesn't actually see the Matcher at all. So per the early, rambling bit of #7431 (comment), is "per-runtime snapshotter" actually the right tool for distinguishing Hyper-V and Process isolation image-choice logic? Should a "per-runtime platform matcher" be used instead? All three of Hyper-V, Process, and Not-At-All (Host process) isolation share the same on-disk format and images, AFAIK, so they should really share a single snapshotter for ease-of-comprehension if nothing else.

kevpar · 2022-10-27T17:11:18Z

All that said, the matcher is needed by the "pull" operation, which is really "Network to content store", the snapshotter doesn't actually see the Matcher at all. So per the early, rambling bit of #7431 (comment), is "per-runtime snapshotter" actually the right tool for distinguishing Hyper-V and Process isolation image-choice logic? Should a "per-runtime platform matcher" be used instead? All three of Hyper-V, Process, and Not-At-All (Host process) isolation share the same on-disk format and images, AFAIK, so they should really share a single snapshotter for ease-of-comprehension if nothing else.

I agree we should have per-runtime-platform-matcher in addition to per-runtime-snapshotter. However there is also an additional complexity of image management, at least with CRI. CRI API defines image operations that key only off of image name, so we need to figure out what happens when you e.g. pull the same image with two different runtimes/platforms. CRI (and thus kubelet) may need to be enlightened to key images on a name/runtime tuple instead.

jterry75 · 2022-10-27T18:07:02Z

@kevpar - I thought that is why we added annotations to PullImage for CRI so we passed in the sandbox so we knew what type of thing to do here right? I get thats sorta a Windows hack but is there a problem using that?

kevpar · 2022-10-27T18:45:35Z

@kevpar - I thought that is why we added annotations to PullImage for CRI so we passed in the sandbox so we knew what type of thing to do here right? I get thats sorta a Windows hack but is there a problem using that?

I think the annotations were added to facilitate passing in what runtime class a given pull should use. Kubelet doesn't actually do this right now AFAIK, though.

TBBle · 2022-10-28T18:43:53Z

An image name plus a PlatformMatcher (to be chosen by annotation once Kubelet is plumbing it through) should resolve down to a unique SHA256 image ID, I think? So as long as all the CRI operations use the same ImageSpec to refer to the manifest or manifest-list, they will end up using the expected image consistently.

I was under the impression that kubelet tracked images by their SHA256 ID (returned in the Image structure in CRI, I guess? I've only worked with this live via dockershim, not CRI) and so as long as ImageStatus and ListImages APIs in the CRI implementation return out the ID of the chosen image manifest even if passed the ImageSpec naming a manifest list, it shouldn't care that the non-SHA256 part of the name is non-unique.

This is the same existing behaviour if a floating tag is named, I guess, and someone updates it between PullImage calls. Unless CRI assumes that PullImage never pulls a newer image if one exists by that name already, even if the registry now points at a different image for that tag? (And further assumes no one uses ctr to untag such an image to force an update, I guess.)

dcantah · 2023-03-08T00:12:04Z

Going to close this out and open issues for the Future items for us to track. The foundation is there for this to work in 1.7 so this accomplished what it set out to do for the release

jiribaloun · 2024-05-30T20:54:27Z

Has anybody some step-by-step guide on how to make contained with hyper-v working?
Thank you

dcantah added the kind/feature label Apr 27, 2022

dcantah added this to the 1.7 milestone Apr 27, 2022

kevpar mentioned this issue May 6, 2022

windows: Add runhcs-wcow-hypervisor runtimeclass to the default config #6901

Merged

dcantah changed the title ~~Windows Hyper-V Container Support For Cri~~ Windows Hyper-V Container Support For CRI May 6, 2022

dcantah mentioned this issue May 10, 2022

Windows Container test coverage for Hyper-V isolation #6701

Open

TBBle mentioned this issue May 25, 2022

Windows Platform Matcher needs to accept ltsc2022 images with later Windows host versions #6508

Open

aznashwan mentioned this issue Jun 6, 2022

Add Workflow for running critest with Hyper-V Containers on Windows. #7025

Merged

fuweid added the platform/windows Windows label Jun 24, 2022

TBBle mentioned this issue Oct 16, 2022

Windows platform matching logic does not allow for running HostProcess containers on any OS version when used in a manifest list #7431

Closed

marosset mentioned this issue Feb 27, 2023

Run Kubernetes e2e tests with HyperV isolated containers kubernetes-sigs/windows-testing#364

Closed

dcantah closed this as completed Mar 8, 2023

jsturtevant mentioned this issue Apr 17, 2023

Hyper-V Windows features still necessary? kubernetes-sigs/sig-windows-tools#296

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows Hyper-V Container Support For CRI #6862

Windows Hyper-V Container Support For CRI #6862

dcantah commented Apr 27, 2022 •

edited

Loading

dcantah commented Apr 27, 2022

jterry75 commented Apr 27, 2022

dcantah commented Apr 28, 2022

jsturtevant commented Apr 28, 2022

kevpar commented May 6, 2022

dcantah commented May 6, 2022 •

edited

Loading

TBBle commented May 25, 2022

bplasmeijer commented Jul 6, 2022

dcantah commented Jul 12, 2022

dcantah commented Jul 12, 2022

sparr commented Oct 20, 2022

marosset commented Oct 20, 2022

fabi200123 commented Oct 27, 2022

TBBle commented Oct 27, 2022

kevpar commented Oct 27, 2022 •

edited

Loading

jterry75 commented Oct 27, 2022

kevpar commented Oct 27, 2022

TBBle commented Oct 28, 2022

dcantah commented Mar 8, 2023

jiribaloun commented May 30, 2024

Windows Hyper-V Container Support For CRI #6862

Windows Hyper-V Container Support For CRI #6862

Comments

dcantah commented Apr 27, 2022 • edited Loading

What is the problem you're trying to solve

Describe the solution you'd like

Initial Support (1.7 timeframe)

Filling in the HyperV runtime spec field

Resource Limits For the VM

Testing

"Full Support"

Pulling images that don't match hosts build

Resource Limits Looking Forward

Additional context

Tracking

1.7

Future

dcantah commented Apr 27, 2022

jterry75 commented Apr 27, 2022

dcantah commented Apr 28, 2022

jsturtevant commented Apr 28, 2022

kevpar commented May 6, 2022

dcantah commented May 6, 2022 • edited Loading

TBBle commented May 25, 2022

bplasmeijer commented Jul 6, 2022

dcantah commented Jul 12, 2022

dcantah commented Jul 12, 2022

sparr commented Oct 20, 2022

marosset commented Oct 20, 2022

fabi200123 commented Oct 27, 2022

TBBle commented Oct 27, 2022

kevpar commented Oct 27, 2022 • edited Loading

jterry75 commented Oct 27, 2022

kevpar commented Oct 27, 2022

TBBle commented Oct 28, 2022

dcantah commented Mar 8, 2023

jiribaloun commented May 30, 2024

dcantah commented Apr 27, 2022 •

edited

Loading

dcantah commented May 6, 2022 •

edited

Loading

kevpar commented Oct 27, 2022 •

edited

Loading