[BUG] [SandboxChanged] Pod sandbox changed, it will be killed and re-created on AKS nodes #48142

iohenkies · 2024-11-21T12:34:48Z

Rancher Server Setup

Rancher version: 2.9.1 (same problem on 2.8.x)
Installation option (Docker install/Helm Chart): Helm chart, RKE1 Rancher managers, versions are 1.30.4 and 1.30.5
Proxy/Cert Details: N/A

Information about the Cluster

Kubernetes version: Local cluster 1.30.4, AKS clusters 1.30.5
Cluster Type (Local/Downstream): Problem on AKS downstream clusters. Created with azurerm Terraform provider, imported with official rancher terraform provider
We actually have two on-prem Rancher management clusters. Each cluster manages several downstream user clusters. All AKS clusters, connected to either one of the rancher management clusters, experience this issue

User Information

What is the role of the user logged in? Admin, but for this issue this is not important

Describe the bug
We're having a problem that mainly seems to happen to out cattle-cluster-agents.

We're running a hybrid infrastructure with on-prem (vSphere based, RKE1) Kubernetes clusters and AKS clusters at Azure. All are version 1.30.4 and 1.30.5. Rancher is version 2.9x. The problem was also in 2.8.x. What we're seeing on all our AKS nodes (9 clusters in total, about 75 nodes), is that after running for a while, mostly 2 to 4 weeks, our cattle-cluster-agents begin restarting.

In the eventlogs of these crashing pods, we see a huge amount of the below:

Normal   SandboxChanged  17m (x1170 over 5d4h)     kubelet  Pod sandbox changed, it will be killed and re-created.

Of course we did our due diligence and performed the required troubleshooting. We ensured memory and CPU can't be an issue, we ensured nothing is blocked on a firewall, we've checked containerd config (most notably the systemd cgroups), refreshed containerd config, checked kubelet logs, checked kubeproxy logs, checked containerd logs, etc. etc.

This is from the rancher pod logs at the manager at the time of the cattle-cluster-agent crash:

2024/11/21 07:37:06 [INFO] error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF

This is the tail from the cattle-cluster-agent right before it crashes:

W1121 11:27:24.673193      56 warnings.go:70] v1 ComponentStatus is deprecated in v1.19 
time="2024-11-21T11:56:57Z" level=warning msg="signal received: \"terminated\", canceling context..."
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=GroupMember workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=Group workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down management.cattle.io/v3, Kind=UserAttribute workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down /v1, Kind=Secret workers"
time="2024-11-21T11:56:57Z" level=info msg="Shutting down /v1, Kind=Secret workers"
time="2024-11-21T11:56:57Z" level=fatal msg="Embedded rancher failed to start: context canceled"

We cannot find why this is happening. This is a piece of log from journalctl on one of the AKS nodes, with what I believe captures one occurrence of such a restart.

Nov 21 06:44:15 aks-default-22588247-vmss000002 kubelet[2881]: I1121 06:44:15.328958    2881 kuberuntime_container.go:779] "Killing container with a grace period" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x" podUID="47c0d77e-695a-4376-b3a9-473e0f859c9d" containerName="cluster-register" containerID="containerd://960f881a193f4ed9e772c9115b72e0629c96f928a8754c214a62b549d1816b27" gracePeriod=30
Nov 21 06:44:15 aks-default-22588247-vmss000002 kubelet[2881]: I1121 06:44:15.532578    2881 kubelet.go:2474] "SyncLoop (PLEG): event for pod" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x" event={"ID":"47c0d77e-695a-4376-b3a9-473e0f859c9d","Type":"ContainerDied","Data":"960f881a193f4ed9e772c9115b72e0629c96f928a8754c214a62b549d1816b27"}
Nov 21 06:44:16 aks-default-22588247-vmss000002 kubelet[2881]: I1121 06:44:16.537074    2881 kubelet.go:2474] "SyncLoop (PLEG): event for pod" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x" event={"ID":"47c0d77e-695a-4376-b3a9-473e0f859c9d","Type":"ContainerDied","Data":"54b4916f0f086879e17727de116c28ea1c77b6a9b921429803da306d56320a18"}
Nov 21 06:44:16 aks-default-22588247-vmss000002 kubelet[2881]: I1121 06:44:16.537171    2881 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x"
Nov 21 06:44:16 aks-default-22588247-vmss000002 kubelet[2881]: I1121 06:44:16.537404    2881 util.go:48] "No ready sandbox for pod can be found. Need to start a new one" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x"
Nov 21 06:44:16 aks-default-22588247-vmss000002 containerd[8178]: time="2024-11-21T06:44:16.689406266Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:cattle-cluster-agent-7987d6c4f9-jmb4x,Uid:47c0d77e-695a-4376-b3a9-473e0f859c9d,Namespace:cattle-system,Attempt:2891,}"
Nov 21 06:44:16 aks-default-22588247-vmss000002 containerd[8178]: time="2024-11-21T06:44:16.894431155Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:cattle-cluster-agent-7987d6c4f9-jmb4x,Uid:47c0d77e-695a-4376-b3a9-473e0f859c9d,Namespace:cattle-system,Attempt:2891,} returns sandbox id \"aa0573a866230772f5328aeaf2eddb1eeed979dddf2a0453be0cae6ef16437d9\""
Nov 21 06:44:16 aks-default-22588247-vmss000002 kubelet[2881]: E1121 06:44:16.896895    2881 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 20s restarting failed container=cluster-register pod=cattle-cluster-agent-7987d6c4f9-jmb4x_cattle-system(47c0d77e-695a-4376-b3a9-473e0f859c9d)\"" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x" podUID="47c0d77e-695a-4376-b3a9-473e0f859c9d"
Nov 21 06:44:17 aks-default-22588247-vmss000002 kubelet[2881]: I1121 06:44:17.541592    2881 kubelet.go:2474] "SyncLoop (PLEG): event for pod" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x" event={"ID":"47c0d77e-695a-4376-b3a9-473e0f859c9d","Type":"ContainerStarted","Data":"aa0573a866230772f5328aeaf2eddb1eeed979dddf2a0453be0cae6ef16437d9"}
Nov 21 06:44:17 aks-default-22588247-vmss000002 kubelet[2881]: E1121 06:44:17.542093    2881 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 20s restarting failed container=cluster-register pod=cattle-cluster-agent-7987d6c4f9-jmb4x_cattle-system(47c0d77e-695a-4376-b3a9-473e0f859c9d)\"" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x" podUID="47c0d77e-695a-4376-b3a9-473e0f859c9d"
Nov 21 06:44:18 aks-default-22588247-vmss000002 kubelet[2881]: E1121 06:44:18.544029    2881 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 20s restarting failed container=cluster-register pod=cattle-cluster-agent-7987d6c4f9-jmb4x_cattle-system(47c0d77e-695a-4376-b3a9-473e0f859c9d)\"" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x" podUID="47c0d77e-695a-4376-b3a9-473e0f859c9d"
Nov 21 06:44:31 aks-default-22588247-vmss000002 kubelet[2881]: E1121 06:44:31.329162    2881 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cluster-register\" with CrashLoopBackOff: \"back-off 20s restarting failed container=cluster-register pod=cattle-cluster-agent-7987d6c4f9-jmb4x_cattle-system(47c0d77e-695a-4376-b3a9-473e0f859c9d)\"" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x" podUID="47c0d77e-695a-4376-b3a9-473e0f859c9d"
Nov 21 06:44:42 aks-default-22588247-vmss000002 kubelet[2881]: I1121 06:44:42.606321    2881 kubelet.go:2474] "SyncLoop (PLEG): event for pod" pod="cattle-system/cattle-cluster-agent-7987d6c4f9-jmb4x" event={"ID":"47c0d77e-695a-4376-b3a9-473e0f859c9d","Type":"ContainerStarted","Data":"f8d74a312d22129b48943bd726f4e28fac7257491775c415857274d791d33295"}

Updating the node image on the AKS nodes or simply creating new AKS nodes and deleting the old ones solve this problem, but only for about 2 to 4 weeks. Then the problem reoccurs and we start all over again.

Our on-prem nodes (RKE1, vSphere, Kubernetes 1.30.4 and 1.30.5) do not experience this behavior. Any tips and help is greatly appreciated.

To Reproduce
Create a new AKS cluster, import the cluster in Rancher, wait 2 to 4 weeks.

Result
Cattle-cluster-agents start crashing after a while (approx. 2 to 4 weeks).

Expected Result
Cattle-cluster-agent on AKS nodes remain stable, don't restart and remain connected to one of the on-prem Rancher management clusters.

Screenshots
Rancher GUI:

K9s status:

Additional context
Please let me know if you need any additional info.

The text was updated successfully, but these errors were encountered:

iohenkies added the kind/bug Issues that are defects reported by users or that we know have reached a real release label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [SandboxChanged] Pod sandbox changed, it will be killed and re-created on AKS nodes #48142

[BUG] [SandboxChanged] Pod sandbox changed, it will be killed and re-created on AKS nodes #48142

iohenkies commented Nov 21, 2024 •

edited

Loading

[BUG] [SandboxChanged] Pod sandbox changed, it will be killed and re-created on AKS nodes #48142

[BUG] [SandboxChanged] Pod sandbox changed, it will be killed and re-created on AKS nodes #48142

Comments

iohenkies commented Nov 21, 2024 • edited Loading

iohenkies commented Nov 21, 2024 •

edited

Loading