[question]: Node Under Utilization #133

uptownhr · 2024-09-11T04:32:56Z

Prior Search

I have already searched this project's issues to determine if a similar question has already been asked.

What is your question?

After upgrading to edge.2024-09-04 and edge.2024-09-10 the node utilization has been sitting around 50%. I've watched for over 4 hours of pods stabilizing and nodes being spun up and down. Now the nodes have stablized for over 2 hours and is no longer consolidating.

Here are all the event logs from nodes that I believe could have been consolidated but were blocked

Disruption Blocked: `pdb "valut/vault" prevents pod eviction
Disruption Blocked: `pdb "authentk/pvc-annotator-..." prevents pod eviction
Disruption Blocked: `pdb "alb-controller/alb-controller" prevents pod eviction
Disruption Blocked: `pdb "cert-manager/cert-manager-webhook" prevents pod eviction
Unconsolidatable: can't remove without creating 2 candidates

I also noticed that not much was being scheduled onto the controller nodes. Both of my controller nodes only have 4 pods running. I don't know if this is expected but seems to be different than what I remember.

What can be done to resolve the PDBs and why aren't they be scheduled on the controller nodes?

What primary components of the stack does this relate to?

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

fullykubed · 2024-09-11T12:02:28Z

why aren't they be scheduled on the controller nodes?

Just verified that on the reference cluster running edge.2024-09-10, there is nothing preventing pods from most of our modules from running on controller nodes.

Keep in mind that the bin-packing scheduler will try to pack the pods onto as few nodes as possible. It is not unusual that some nodes will have very low utilization for that reason. That allows Karpenter to spin them down. However, Karpenter will never spin down a controller node. I believe that is likely what you are seeing here.

We can look into a way to optimize the behavior here.

What can be done to resolve the PDBs?

A PDB blocks disruption when not enough pods in its set are running and healthy as to allow further disruption. You need to provide more information here for each PDB, specifically why pods in their sets are already unhealthy. Typically this is because they have already been evicted for one reason or another. You can look at the Kubernetes events to find all the reasons for a pod's eviction.

Looking at the reference cluster running edge.2024-09-10, I am seeing >90% utilization. As a result, you should ensure you have everything upgraded and then look into why your cluster is unstable wrt your PDBs.

fullykubed · 2024-09-12T14:27:57Z

An optimization for the controller node bin-packing has been included in the next release.

uptownhr added question Further information is requested triage Needs to be triaged labels Sep 11, 2024

uptownhr assigned fullykubed Sep 11, 2024

fullykubed removed the triage Needs to be triaged label Sep 11, 2024

fullykubed closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question]: Node Under Utilization #133

[question]: Node Under Utilization #133

uptownhr commented Sep 11, 2024

fullykubed commented Sep 11, 2024

fullykubed commented Sep 12, 2024

[question]: Node Under Utilization #133

[question]: Node Under Utilization #133

Comments

uptownhr commented Sep 11, 2024

Prior Search

What is your question?

What primary components of the stack does this relate to?

Code of Conduct

fullykubed commented Sep 11, 2024

fullykubed commented Sep 12, 2024