Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubelet: get cgroup driver config from CRI #118770

Merged

Conversation

marquiz
Copy link
Contributor

@marquiz marquiz commented Jun 20, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR implements kubelet cgroup driver detection over the CRI API, implemented in multiple patches:

  • cri-api: add rpc for querying runtime configuration
    This patch adds a new rpc to the runtime service to query CRI runtime
    configuration options. For now, it only contains one field for getting
    the cgroup driver (systemd or cgroupfs) to be used.
  • Add CRI fake runtimes for RuntimeConfig rpc
    Also update the CRI RuntimeService inteface to include the new
    RuntimeConfig rpc.
  • Add KubeletCgroupDriverFromCRI feature gate
  • kubelet: initialization of runtime service earlier in the startup
    This patch refactors the kubelet startup code to initialize the runtime
    service earlier in the startup sequence. We want this to be able to
    query the cgroup driver setting from the CRI befure initializing the
    cgroup manager.
  • kubelet: get cgroup driver config from CRI
    This patch modifies kubelet to get the cgroup driver setting from the
    CRI runtime using the newly added RuntimeConfig rpc. The new code path
    only takes place if the KubeletCgroupDriverFromCRI feature gate is
    enabled. If the runtime returns a not-implemented error kubelet falls
    back to using the cgroupDriver configuration option, with a log message
    instructing the user to upgrade to w newer container runtime. Other rpc
    errors cause kubelet to exit as is the case if the runtime returns an
    unknown cgroup driver.

Which issue(s) this PR fixes:

Refs kubernetes/enhancements#4033

Special notes for your reviewer:

Still work in progress: most importantly unit tests are not included.
If so desired, the changes to cri-api could be submitted as a separate PR.

Does this PR introduce a user-facing change?

With the KubeletCgroupDriverFromCRI feature gate enabled and sufficiently new version of a container
runtime, kubelet automatically detects the cgroup driver config from the container runtime, eliminating
the need to specify the `cgroupDriver` configuration option (or --cgroup-driver` flag) of kubelet.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/4033-group-driver-detection-over-cri

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 20, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 20, 2023
@k8s-ci-robot k8s-ci-robot requested review from mrunalp and odinuge June 20, 2023 16:48
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 20, 2023
@marquiz
Copy link
Contributor Author

marquiz commented Jun 20, 2023

This is an early version of the PR, especially targeted for bike-shedding on the CRI API naming. It also contains a full implementation (sans unit tests) to see how the new CRI rpc would be used

/cc @mrunalp @mikebrow @haircommander @sftim

@sftim
Copy link
Contributor

sftim commented Jun 20, 2023

Changelog suggestion for when this is ready:

Added automatic detection of cgroup driver for Linux nodes.

With the `KubeletCgroupDriverFromCRI` feature gate enabled, and sufficiently new version of a container
runtime, kubelet automatically detects the cgroup driver configuration from the container runtime,
eliminating the need to specify the `cgroupDriver` configuration option (nor the deprecated `--cgroup-driver`
command line argument) for the kubelet.

I mentioned Linux because Windows nodes don"t have cgroups.

return err
}
// CRI implementation doesn't support RuntimeConfig, fallback
klog.InfoS("CRI implementation should be updated to support RuntimeConfig when KubeletCgroupDriverFromCRI feature gate has been enabled. Falling back to using cgroupDriver from kubelet config.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For beta, maybe we can even record an Event into the API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good for the user. The proposal now states that for beta Drop fallback to old behavior. CRI implementations expected to have support. We can of course change that, or then add an event for alpha already 🧐

Comment on lines +638 to +640
if utilfeature.DefaultFeatureGate.Enabled(features.KubeletCgroupDriverFromCRI) {
if err := getCgroupDriverFromCRI(ctx, s, kubeDeps); err != nil {
return err
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we do this:

  • per new connection to the socket?
  • per kubelet startup?

I"m thinking that if someone restarts the process at the other end of the socket, we should re-query this setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens only at kubelet startup.

Technically we could query it at CRI reconnection but I wouldn"t go there.

  1. The cgroup driver is really not something that"s expected to be changed without taking the node down. Changing it correctly requires restarting all pods, also the static ones. System reboot is a sage method.
  2. I don"t know what all would break in kubelet (internally) if we changed the cgroup-driver in-flight

We could re-query the cgroup driver at re-connection and exit if the cgroup driver setting has changed. But I"m not sure if that"s desired. Thoughts @SergeyKanzhelev @mrunalp @haircommander

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our efforts would be better spent making it very clear a change in the cgroup driver in the CRI requires a node reboot

}
// CRI implementation doesn't support RuntimeConfig, fallback
klog.InfoS("CRI implementation should be updated to support RuntimeConfig when KubeletCgroupDriverFromCRI feature gate has been enabled. Falling back to using cgroupDriver from kubelet config.")
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can there be a comment saying where the cgroup driver is coming from/set by if this case happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now states Falling back to using cgroupDriver from kubelet config. Do you suggest something else/in addition? @haircommander

case runtimeapi.CgroupDriver_CGROUPFS:
s.CgroupDriver = "cgroupfs"
default:
return fmt.Errorf("runtime returned an unknown cgroup driver %d", d)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don"t know what the kubelet"s policy on panics are, but this seems like a prime reason to panic as it"s a programming error that should never happen. I"m not stubborn about that though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, It"s a programming, but on the CRI runtime"s side, out of our (kubelet"s) control so I"m not sure we want to panic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let"s fall back to the configuration file for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more... this should be treated differently than if runtime is not available, for example, or returned error, going forward. I think if runtime returned error we will need to retry a few times and fallback to config. If it"s a new unknown value, kubelet should not retry and should exit immediately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, is this what you were thinking @SergeyKanzhelev

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalp I added it bc of @SergeyKanzhelev "s comment. I am personally fine without it, i"ll let y"all decide :)

@marquiz
Copy link
Contributor Author

marquiz commented Jun 21, 2023

@haircommander @mrunalp @mikebrow @sftim any thoughts on the CRI API naming?

The main addition is now

rpc RuntimeConfig(RuntimeConfigRequest) returns (RuntimeConfigResponse)

We have rpc Status(...) so maybe it could be just rpc Configuration(...) or smth.

But then we have also

rpc UpdateRuntimeConfig(UpdateRuntimeConfigRequest) returns (UpdateRuntimeConfigResponse) {}

message UpdateRuntimeConfigRequest {
    RuntimeConfig runtime_config = 1;
}

So the naming is easily clashing...

@haircommander
Copy link
Contributor

yeah I guess RuntimeConfig is the best we"ve got so far, considering we use RuntimeStatus, RuntimeCondition, and UpdateRuntimeConfig.

An issue though is this runtime config has nothing to do with UpdateRuntimeConfig 🙃

@mikebrow
Copy link
Member

@haircommander @mrunalp @mikebrow @sftim any thoughts on the CRI API naming?

The main addition is now

rpc RuntimeConfig(RuntimeConfigRequest) returns (RuntimeConfigResponse)

We have rpc Status(...) so maybe it could be just rpc Configuration(...) or smth.

But then we have also

rpc UpdateRuntimeConfig(UpdateRuntimeConfigRequest) returns (UpdateRuntimeConfigResponse) {}

message UpdateRuntimeConfigRequest {
    RuntimeConfig runtime_config = 1;
}

So the naming is easily clashing...

naming is hard... over on the status api we currently have the ability to provide any desired runtime config response contents ... at least for prototyping just need to agree on the nvp graph/map esp. with status verbose content...

see crictl "info" response

(base) root@ubnt:~# crictl info
{
  "status": {
    "conditions": [
      {
        "type": "RuntimeReady",
        "status": true,
        "reason": "",
        "message": ""
      },
      {
        "type": "NetworkReady",
        "status": true,
        "reason": "",
        "message": ""
      }
    ]
  },
  "cniconfig": {
    "PluginDirs": [
      "/opt/cni/bin"
    ],
    "PluginConfDir": "/etc/cni/net.d",
    "PluginMaxConfNum": 1,
    "Prefix": "eth",
    "Networks": [
      {
        "Config": {
          "Name": "cni-loopback",
          "CNIVersion": "0.3.1",
          "Plugins": [
            {
              "Network": {
                "type": "loopback",
                "ipam": {},
                "dns": {}
              },
              "Source": "{\"type\":\"loopback\"}"
            }
          ],
          "Source": "{\n\"cniVersion\": \"0.3.1\",\n\"name\": \"cni-loopback\",\n\"plugins\": [{\n  \"type\": \"loopback\"\n}]\n}"
        },
        "IFName": "lo"
      },
      {
        "Config": {
          "Name": "containerd-net",
          "CNIVersion": "1.0.0",
          "Plugins": [
            {
              "Network": {
                "type": "bridge",
                "ipam": {
                  "type": "host-local"
                },
                "dns": {}
              },
              "Source": "{\"bridge\":\"cni0\",\"ipMasq\":true,\"ipam\":{\"ranges\":[[{\"subnet\":\"10.88.0.0/16\"}],[{\"subnet\":\"2001:4860:4860::/64\"}]],\"routes\":[{\"dst\":\"0.0.0.0/0\"},{\"dst\":\"::/0\"}],\"type\":\"host-local\"},\"isGateway\":true,\"promiscMode\":true,\"type\":\"bridge\"}"
            },
            {
              "Network": {
                "type": "portmap",
                "capabilities": {
                  "portMappings": true
                },
                "ipam": {},
                "dns": {}
              },
              "Source": "{\"capabilities\":{\"portMappings\":true},\"type\":\"portmap\"}"
            }
          ],
          "Source": "{\n  \"cniVersion\": \"1.0.0\",\n  \"name\": \"containerd-net\",\n  \"plugins\": [\n    {\n      \"type\": \"bridge\",\n      \"bridge\": \"cni0\",\n      \"isGateway\": true,\n      \"ipMasq\": true,\n      \"promiscMode\": true,\n      \"ipam\": {\n        \"type\": \"host-local\",\n        \"ranges\": [\n          [{\n            \"subnet\": \"10.88.0.0/16\"\n          }],\n          [{\n            \"subnet\": \"2001:4860:4860::/64\"\n          }]\n        ],\n        \"routes\": [\n          { \"dst\": \"0.0.0.0/0\" },\n          { \"dst\": \"::/0\" }\n        ]\n      }\n    },\n    {\n      \"type\": \"portmap\",\n      \"capabilities\": {\"portMappings\": true}\n    }\n  ]\n}\n"
        },
        "IFName": "eth0"
      }
    ]
  },
  "config": {
    "containerd": {
      "snapshotter": "overlayfs",
      "defaultRuntimeName": "runc",
      "defaultRuntime": {
        "runtimeType": "",
        "runtimePath": "",
        "runtimeEngine": "",
        "PodAnnotations": [],
        "ContainerAnnotations": [],
        "runtimeRoot": "",
        "options": {},
        "privileged_without_host_devices": false,
        "baseRuntimeSpec": "",
        "cniConfDir": "",
        "cniMaxConfNum": 0
      },
      "untrustedWorkloadRuntime": {
        "runtimeType": "",
        "runtimePath": "",
        "runtimeEngine": "",
        "PodAnnotations": [],
        "ContainerAnnotations": [],
        "runtimeRoot": "",
        "options": {},
        "privileged_without_host_devices": false,
        "baseRuntimeSpec": "",
        "cniConfDir": "",
        "cniMaxConfNum": 0
      },
      "runtimes": {
        "runc": {
          "runtimeType": "io.containerd.runc.v2",
          "runtimePath": "",
          "runtimeEngine": "",
          "PodAnnotations": [],
          "ContainerAnnotations": [],
          "runtimeRoot": "",
          "options": {
            "BinaryName": "",
            "CriuImagePath": "",
            "CriuPath": "",
            "CriuWorkPath": "",
            "IoGid": 0,
            "IoUid": 0,
            "NoNewKeyring": false,
            "NoPivotRoot": false,
            "Root": "",
            "ShimCgroup": "",
            "SystemdCgroup": false
          },
          "privileged_without_host_devices": false,
          "baseRuntimeSpec": "",
          "cniConfDir": "",
          "cniMaxConfNum": 0
        }
      },
      "noPivot": false,
      "disableSnapshotAnnotations": true,
      "discardUnpackedLayers": false,
      "ignoreRdtNotEnabledErrors": false
    },
    "cni": {
      "binDir": "/opt/cni/bin",
      "confDir": "/etc/cni/net.d",
      "maxConfNum": 1,
      "confTemplate": "",
      "ipPref": ""
    },
    "registry": {
      "configPath": "",
      "mirrors": {},
      "configs": {},
      "auths": {},
      "headers": {}
    },
    "imageDecryption": {
      "keyModel": "node"
    },
    "disableTCPService": true,
    "streamServerAddress": "127.0.0.1",
    "streamServerPort": "0",
    "streamIdleTimeout": "4h0m0s",
    "enableSelinux": false,
    "selinuxCategoryRange": 1024,
    "sandboxImage": "registry.k8s.io/pause:3.8",
    "statsCollectPeriod": 10,
    "systemdCgroup": false,
    "enableTLSStreaming": false,
    "x509KeyPairStreaming": {
      "tlsCertFile": "",
      "tlsKeyFile": ""
    },
    "maxContainerLogSize": 16384,
    "disableCgroup": false,
    "disableApparmor": false,
    "restrictOOMScoreAdj": false,
    "maxConcurrentDownloads": 3,
    "disableProcMount": false,
    "unsetSeccompProfile": "",
    "tolerateMissingHugetlbController": true,
    "disableHugetlbController": true,
    "device_ownership_from_security_context": false,
    "ignoreImageDefinedVolumes": false,
    "netnsMountsUnderStateDir": false,
    "enableUnprivilegedPorts": false,
    "enableUnprivilegedICMP": false,
    "containerdRootDir": "/var/lib/containerd",
    "containerdEndpoint": "/run/containerd/containerd.sock",
    "rootDir": "/var/lib/containerd/io.containerd.grpc.v1.cri",
    "stateDir": "/run/containerd/io.containerd.grpc.v1.cri"
  },
  "golang": "go1.19.9",
  "lastCNILoadStatus": "OK",
  "lastCNILoadStatus.default": "OK"
}

@mikebrow
Copy link
Member

mikebrow commented Jun 21, 2023

yeah I guess RuntimeConfig is the best we"ve got so far, considering we use RuntimeStatus, RuntimeCondition, and UpdateRuntimeConfig.

An issue though is this runtime config has nothing to do with UpdateRuntimeConfig 🙃

The original idea for UpdateRuntimeConfig was to add more optional structs as we added more runtime configuration options for kubelet..

Maybe we don"t need new apis so much as we need to add fields to existing update structs and consider some constants/graph/map format for status info.

@haircommander
Copy link
Contributor

The original idea for UpdateRuntimeConfig was to add more optional structs as we added more runtime configuration options for kubelet..

not within scope here: something we should think about long term is the direction of data. UpdateRuntimeConfig tells cri to update (kubelet->cri). RuntimeConfig is kubelet asking CRI what to do (cri->kubelet). as time goes on, we probably want to choose one direction of information

@mikebrow
Copy link
Member

after discussion with @haircommander perhaps the new runtime service cri api should be StaticRuntimeConfig to cover retrieving the cgroup driver for now and we can add more content that won"t change, without a reboot of the container runtime as needed...

@marquiz marquiz force-pushed the devel/cgroup-driver-autoconfig branch from 4749ca6 to b8f1cb4 Compare June 22, 2023 05:18
marquiz and others added 4 commits July 17, 2023 12:27
This patch refactors the kubelet startup code to initialize the runtime
service earlier in the startup sequence. We want this to be able to
query the cgroup driver setting from the CRI befure initializing the
cgroup manager.
This patch modifies kubelet to get the cgroup driver setting from the
CRI runtime using the newly added RuntimeConfig rpc. The new code path
only takes place if the KubeletCgroupDriverFromCRI feature gate is
enabled. If the runtime returns a not-implemented error kubelet falls
back to using the cgroupDriver configuration option, with a log message
instructing the user to upgrade to w newer container runtime. Other rpc
errors cause kubelet to exit as is the case if the runtime returns an
unknown cgroup driver.
@haircommander haircommander force-pushed the devel/cgroup-driver-autoconfig branch from c83aa53 to d79d775 Compare July 17, 2023 16:46
@haircommander haircommander force-pushed the devel/cgroup-driver-autoconfig branch from d79d775 to bfa62e0 Compare July 17, 2023 17:05
@haircommander
Copy link
Contributor

/retest

Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

A few notes:

  1. The call to PreInitRuntimeService was moved to earlier in the flow unconditionally. This change the behavior when feature gate is not set. Looking at what is being done between the new callsite and the old callsite - the risk that it will break something is very low.
  2. Changelog file for CRI API will need to be updated. Close to code freeze it"s best to do as a follow up as there is another PR updating it now.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 51a623a1e04224820eee47f0c4c660229b07713b

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marquiz, mikebrow, mrunalp, SergeyKanzhelev

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 17, 2023
@k8s-ci-robot k8s-ci-robot merged commit 1fef8fd into kubernetes:master Jul 17, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone Jul 17, 2023
@marquiz marquiz deleted the devel/cgroup-driver-autoconfig branch July 18, 2023 05:58
@saschagrunert
Copy link
Member

@marquiz is the release note still WIP or can we remove that prefix?

@marquiz
Copy link
Contributor Author

marquiz commented Jul 18, 2023

@marquiz is the release note still WIP or can we remove that prefix?

I removed the wip prefix

@saschagrunert
Copy link
Member

Thank you @marquiz! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

8 participants