[BUG] [SandboxChanged] Pod sandbox changed, it will be killed and re-created on AKS nodes #48142
Labels
kind/bug
Issues that are defects reported by users or that we know have reached a real release
Rancher Server Setup
Information about the Cluster
azurerm
Terraform provider, imported with officialrancher
terraform providerUser Information
Describe the bug
We're having a problem that mainly seems to happen to out
cattle-cluster-agents
.We're running a hybrid infrastructure with on-prem (vSphere based, RKE1) Kubernetes clusters and AKS clusters at Azure. All are version 1.30.4 and 1.30.5. Rancher is version 2.9x. The problem was also in 2.8.x. What we're seeing on all our AKS nodes (9 clusters in total, about 75 nodes), is that after running for a while, mostly 2 to 4 weeks, our
cattle-cluster-agents
begin restarting.In the eventlogs of these crashing pods, we see a huge amount of the below:
Of course we did our due diligence and performed the required troubleshooting. We ensured memory and CPU can't be an issue, we ensured nothing is blocked on a firewall, we've checked
containerd
config (most notably the systemd cgroups), refreshedcontainerd
config, checkedkubelet
logs, checkedkubeproxy
logs, checkedcontainerd
logs, etc. etc.This is from the
rancher
pod logs at the manager at the time of thecattle-cluster-agent
crash:This is the tail from the
cattle-cluster-agent
right before it crashes:We cannot find why this is happening. This is a piece of log from
journalctl
on one of the AKS nodes, with what I believe captures one occurrence of such a restart.Updating the node image on the AKS nodes or simply creating new AKS nodes and deleting the old ones solve this problem, but only for about 2 to 4 weeks. Then the problem reoccurs and we start all over again.
Our on-prem nodes (RKE1, vSphere, Kubernetes 1.30.4 and 1.30.5) do not experience this behavior. Any tips and help is greatly appreciated.
To Reproduce
Create a new AKS cluster, import the cluster in Rancher, wait 2 to 4 weeks.
Result
Cattle-cluster-agents start crashing after a while (approx. 2 to 4 weeks).
Expected Result
Cattle-cluster-agent on AKS nodes remain stable, don't restart and remain connected to one of the on-prem Rancher management clusters.
Screenshots
Rancher GUI:
K9s status:
Additional context
Please let me know if you need any additional info.
The text was updated successfully, but these errors were encountered: