-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
register Node/UpdateTaint event to plugins which has Node/Add only and doesn't have Node/UpdateTaint #122292
Conversation
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Wed Dec 13 05:43:10 UTC 2023. |
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@kerthcet This is my proposal. I know I have to add UTs but open it now to show you my idea. |
We have two approaches:
And I have two concerns:
|
cc @kubernetes/sig-scheduling-misc |
Thanks, that's a good summary. I'm on the same page with @kerthcet.
1
I think Yes, so I'd prefer this approach basically. |
/hold |
Given no clear objection to this idea so far, I'll work on making it ready this weekend. |
Can we do this only if the featuregate for scheduling hints is enabled? |
Regardless of whether QHint is enabled or not, this bug could happen when EventsToRegister has NodeAdd, but doesn't have NodeUpdate. (#109437 shows it happens before QHint) So, I believe we should do this even if the feature gate is disabled |
7d2e383
to
4e7890d
Compare
Rebased and squashed into one commit with fixing one point from Aldo last time. |
/retitle register Node/UpdateTaint event to plugins which has Node/Add only and doesn't have Node/UpdateTaint |
pInfo.UnschedulablePlugins
for a correct requeuein
/lgtm |
LGTM label has been added. Git tree hash: a5649617d9bd172b548ff6d2cf2b27e2af81e523
|
Also, it might be worth changing plugins such as NodeResourcesFit to use UpdateTaint instead of Update. But we can leave that for after the code freeze. |
@MrHohn Can you take a look at this PR? Only minor comment change is your area. |
@pohly could you give us a hand? |
@ahg-g and I agreed that, as a bug, this qualifies for the test freeze /milestone v1.30 |
I looked through the discussion thread, but I am too much out of the topic to have a specific opinion. |
@@ -396,7 396,7 @@ func (pl *dynamicResources) EventsToRegister() []framework.ClusterEventWithHint | |||
{Event: framework.ClusterEvent{Resource: framework.PodSchedulingContext, ActionType: framework.Add | framework.Update}, QueueingHintFn: pl.isSchedulableAfterPodSchedulingContextChange}, | |||
// A resource might depend on node labels for topology filtering. | |||
// A new or updated node may make pods schedulable. | |||
{Event: framework.ClusterEvent{Resource: framework.Node, ActionType: framework.Add | framework.UpdateNodeLabel}}, | |||
{Event: framework.ClusterEvent{Resource: framework.Node, ActionType: framework.Add | framework.UpdateNodeLabel | framework.UpdateNodeTaint}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the code behind registerNodeAdded && !registerNodeTaintUpdated
below adding this automatically? Does it still make sense to also add this manually?
A short comment here that this is a workaround would be useful to avoid confusion when reading just this file, because nothing in it depends on node taints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it still make sense to also add this manually?
Yes, especially for QHint - When you add QHint for Node/Add, you probably want to have the same QHint for Node/UpdateNodeTaint, which you couldn't do if we let the scheduler automatically add UpdateNodeTaint
.
A short comment here that this is a workaround would be useful to avoid confusion when reading just this file, because nothing in it depends on node taints.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably want to have the same QHint for Node/UpdateNodeTaint,
This is the part that I don't get. This particular plugin doesn't check taints of a node. It leaves that to the other plugins which check whether a pod fits the node.
If you are adding it because "it might be needed", then it might be necessary. If it's getting added because "node added" is not delivered reliably (= i.e. it's a workaround), then it makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that "node added" events are not delivered reliably.
It's that a new node is likely not ready (it has a not-ready taint). And it needs to be retried when it becomes ready.
…y, doesn't have Node/UpdateNodeTaint
As suggested, added the comments in each plugin's Node/UpdateTaint event to give context to readers. @pohly Your approval would be helpful, as you're a test/OWNERS approver. We only have a small comment update in test/ area. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, pohly, sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
LGTM label has been added. Git tree hash: c0bb0070bcf9928962d1984973bfe05fdf9f7b4e
|
What type of PR is this?
/kind bug
What this PR does / why we need it:
register a failure plugin from preCheck into pInfo.UnschedulablePlugins so that we could technically prevent #109437 happening again.
Which issue(s) this PR fixes:
Part of #122284
Fixes #122306
Part of #122305
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: