Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question]: Node Under Utilization #133

Closed
2 tasks done
uptownhr opened this issue Sep 11, 2024 · 2 comments
Closed
2 tasks done

[question]: Node Under Utilization #133

uptownhr opened this issue Sep 11, 2024 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@uptownhr
Copy link
Collaborator

Prior Search

  • I have already searched this project's issues to determine if a similar question has already been asked.

What is your question?

After upgrading to edge.2024-09-04 and edge.2024-09-10 the node utilization has been sitting around 50%. I've watched for over 4 hours of pods stabilizing and nodes being spun up and down. Now the nodes have stablized for over 2 hours and is no longer consolidating.

Here are all the event logs from nodes that I believe could have been consolidated but were blocked

  • Disruption Blocked: `pdb "valut/vault" prevents pod eviction
  • Disruption Blocked: `pdb "authentk/pvc-annotator-..." prevents pod eviction
  • Disruption Blocked: `pdb "alb-controller/alb-controller" prevents pod eviction
  • Disruption Blocked: `pdb "cert-manager/cert-manager-webhook" prevents pod eviction
  • Unconsolidatable: can't remove without creating 2 candidates

I also noticed that not much was being scheduled onto the controller nodes. Both of my controller nodes only have 4 pods running. I don't know if this is expected but seems to be different than what I remember.

image image image

What can be done to resolve the PDBs and why aren't they be scheduled on the controller nodes?

What primary components of the stack does this relate to?

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@uptownhr uptownhr added question Further information is requested triage Needs to be triaged labels Sep 11, 2024
@fullykubed fullykubed removed the triage Needs to be triaged label Sep 11, 2024
@fullykubed
Copy link
Member

why aren't they be scheduled on the controller nodes?

Just verified that on the reference cluster running edge.2024-09-10, there is nothing preventing pods from most of our modules from running on controller nodes.

Keep in mind that the bin-packing scheduler will try to pack the pods onto as few nodes as possible. It is not unusual that some nodes will have very low utilization for that reason. That allows Karpenter to spin them down. However, Karpenter will never spin down a controller node. I believe that is likely what you are seeing here.

We can look into a way to optimize the behavior here.

What can be done to resolve the PDBs?

A PDB blocks disruption when not enough pods in its set are running and healthy as to allow further disruption. You need to provide more information here for each PDB, specifically why pods in their sets are already unhealthy. Typically this is because they have already been evicted for one reason or another. You can look at the Kubernetes events to find all the reasons for a pod's eviction.


Looking at the reference cluster running edge.2024-09-10, I am seeing >90% utilization. As a result, you should ensure you have everything upgraded and then look into why your cluster is unstable wrt your PDBs.

@fullykubed
Copy link
Member

An optimization for the controller node bin-packing has been included in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants