Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaintController #40355

Merged
merged 2 commits into from
Feb 10, 2017
Merged

TaintController #40355

merged 2 commits into from
Feb 10, 2017

Conversation

gmarek
Copy link
Contributor

@gmarek gmarek commented Jan 24, 2017

This PR adds a manager to NodeController that is responsible for removing Pods from Nodes tainted with NoExecute Taints. This feature is beta (as the rest of taints) and enabled by default. It's gated by controller-manager enable-taint-manager flag.

@gmarek gmarek added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Jan 24, 2017
@gmarek gmarek self-assigned this Jan 24, 2017
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 24, 2017
@k8s-reviewable
Copy link

This change is Reviewable

@k8s-github-robot k8s-github-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API size/XXL Denotes a PR that changes 1000 lines, ignoring generated files. release-note-label-needed labels Jan 24, 2017
@gmarek gmarek force-pushed the nc-taint-handling branch 2 times, most recently from 2eb71de to 5b33298 Compare January 24, 2017 15:51
@gmarek
Copy link
Contributor Author

gmarek commented Jan 24, 2017

@timothysc timothysc added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-label-needed labels Jan 24, 2017
@@ -141,6 141,8 @@ type NodeController struct {
daemonSetStore listers.StoreToDaemonSetLister
// allocate/recycle CIDRs for node if allocateNodeCIDRs == true
cidrAllocator CIDRAllocator
// manages taints
taintManager *TaintController
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a detailed description (from our emails) of why the TaintController is special cased to the NodeController, as this is different then all other controllers.

/cc @aveshagarwal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a very, very early untested prototype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So don't expect comments, or anything;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's further along now so I think comment is appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup - I'm applying all comments today. @timothysc - do you have a piece of email I can paste here (I wasn't taking part in those discussions)

@gmarek gmarek force-pushed the nc-taint-handling branch from 5b33298 to 796cf76 Compare January 24, 2017 16:23
@timothysc timothysc added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 24, 2017
continue
}
podNamespacedName := types.NamespacedName{Namespace: pods[i].Namespace, Name: pods[i].Name}
if len(tolerations) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume there is a node nodeA tainted with taint {foo=bar:NoSchedule}, and pod podA which has no tolerations is running on nodeA, according to the check here, it results in podA evicted.
Actually what we want is just evict pods those don't tolerate NoExecute taints of nodes, or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do filter everything except NoExecute taints in XUpdated functions, so no other taints appear in the rest of the code (I hope).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that's fine.

@gmarek gmarek force-pushed the nc-taint-handling branch from 796cf76 to dff53b8 Compare January 25, 2017 16:07
@gmarek
Copy link
Contributor Author

gmarek commented Jan 25, 2017

I added unit tests. Only e2e test, flag gate in NC and logging tweaks are needed to finish this PR. I hope to finish it tomorrow. @kevin-wangzefeng @davidopp @wojtek-t @timothysc

Oh, and we need way more pkg/controller approvers: @bprashanth @mikedanese @derekwaynecarr @deads2k - probably split by controller.

@gmarek gmarek force-pushed the nc-taint-handling branch from dff53b8 to 4969754 Compare January 25, 2017 16:22
@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 25, 2017
@gmarek gmarek force-pushed the nc-taint-handling branch 5 times, most recently from 08d4aff to 92f79c9 Compare January 26, 2017 15:02
@gmarek
Copy link
Contributor Author

gmarek commented Jan 26, 2017

I'm getting errors:

I0126 15:04:54.761] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:51: undefined: CertificateSigningRequest
I0126 15:04:54.762] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:62: undefined: CertificateSigningRequest
I0126 15:04:54.762] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:66: undefined: CertificateSigningRequest
I0126 15:04:54.763] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:77: undefined: CertificateSigningRequest
I0126 15:04:54.764] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:81: undefined: CertificateSigningRequestCondition
I0126 15:04:54.764] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:89: undefined: CertificateSigningRequestCondition
I0126 15:04:54.765] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:93: undefined: CertificateSigningRequestCondition
I0126 15:04:54.766] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:101: undefined: CertificateSigningRequestCondition
I0126 15:04:54.766] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:105: undefined: CertificateSigningRequestList
I0126 15:04:54.767] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:111: undefined: CertificateSigningRequestList
I0126 15:04:54.768] k8s.io/kubernetes/pkg/apis/certificates/v1alpha1/zz_generated.conversion.go:111: too many errors

@mikedanese @liggitt @cjcullen - any ideas?

@bprashanth
Copy link
Contributor

Re #40355 (comment), there's an OWNERS file per dir, the top lever owners are just for those directories that don't have one, I think. Feel free to add people to it.

@liggitt
Copy link
Member

liggitt commented Jan 26, 2017

the certificates api dropped v1alpha1 types in #39772

@timothysc timothysc added this to the v1.6 milestone Feb 8, 2017
@timothysc
Copy link
Member

@wojtek-t @davidopp I only have a minor nit on testing now. I'd like to get this one out of the way and move on now that I'm unblocked.

@timothysc
Copy link
Member

please squash, and we need to get the tests green again.

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: gmarek, mikedanese

Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @deads2k @vishh @smarterclayton
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@gmarek
Copy link
Contributor Author

gmarek commented Feb 10, 2017

@timothysc @wojtek-t squashed commits.

@timothysc timothysc added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 10, 2017
@timothysc
Copy link
Member

Reviewed 5 of 17 files at r7, 1 of 2 files at r10, 6 of 8 files at r12, 2 of 2 files at r13, 1 of 1 files at r14.
Review status: all files reviewed at latest revision, 38 unresolved discussions.


Comments from Reviewable

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 39418, 41175, 40355, 41114, 32325)

@k8s-github-robot k8s-github-robot merged commit a7a74b5 into kubernetes:master Feb 10, 2017
@@ -104,6 104,7 @@ func NewCMServer() *CMServer {
ClusterSigningCertFile: "/etc/kubernetes/ca/ca.pem",
ClusterSigningKeyFile: "/etc/kubernetes/ca/ca.key",
ReconcilerSyncLoopPeriod: metav1.Duration{Duration: 5 * time.Second},
EnableTaintManager: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this should be true.

@gmarek What are you referring to as the first of the three PRs -- is it #41068 ?

@@ -196,6 197,7 @@ func (s *CMServer) AddFlags(fs *pflag.FlagSet, allControllers []string, disabled
fs.Float32Var(&s.UnhealthyZoneThreshold, "unhealthy-zone-threshold", 0.55, "Fraction of Nodes in a zone which needs to be not Ready (minimum 3) for zone to be treated as unhealthy. ")
fs.BoolVar(&s.DisableAttachDetachReconcilerSync, "disable-attach-detach-reconcile-sync", false, "Disable volume attach detach reconciler sync. Disabling this may cause volumes to be mismatched with pods. Use wisely.")
fs.DurationVar(&s.ReconcilerSyncLoopPeriod.Duration, "attach-detach-reconcile-sync-period", s.ReconcilerSyncLoopPeriod.Duration, "The reconciler sync wait time between volume attach detach. This duration must be larger than one second, and increasing this value from the default may allow for volumes to be mismatched with pods.")
fs.BoolVar(&s.EnableTaintManager, "enable-taint-manager", s.EnableTaintManager, "WARNING: Beta feature. If set to true enables NoExecute Taints and will evict all not-tolerating Pod running on Nodes tainted with this kind of Taints.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to make this be a flag? The code would be simpler without it, and beta is enabled by default... I think if this feature is broken we would do a patch release, not just disable it.

@@ -312,6 356,9 @@ func NewNodeController(
utilruntime.HandleError(fmt.Errorf("Error allocating CIDR: %v", err))
}
}
if nc.taintManager != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to make taintManager conditional. If it is broken we will do a patch release. Making it unconditional will simplify the code.

}

func RunPodAndGetNodeName(c clientset.Interface, pod *v1.Pod, timeout time.Duration) (string, error) {
retries := 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already defined as a top-level const, so delete this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

func getNonExecuteTaints(taints []v1.Taint) []v1.Taint {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why Non? getNoExecuteTaints would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
}

// TimedWorkerQueue keeps a set of TimedWorkers that still wait for execution.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/still wait/are still waiting/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// TimedWorkerQueue keeps a set of TimedWorkers that still wait for execution.
type TimedWorkerQueue struct {
sync.Mutex
workers map[string]*TimedWorker
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment what is the map key

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

q.Lock()
defer q.Unlock()
if err == nil {
q.workers[key] = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't you delete(q.workers, key) in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid thrashing - it's enough to delete Pod once, so I keep the key as a way to dedup calls.

}
}

// Creates and startes a controller (informer) that watches updates on a pod in given namespace with given name. It puts a new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/startes/starts/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 1,335 @@
/*
Copy link
Member

@davidopp davidopp Feb 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a lot of work but can you add a test with three pods on the same node

  • one tolerates T without tolerationSeconds
  • one tolerates T for some tolerationSeconds==t
  • one does not tolerate T

Then add taint T to the node and verify that the first one is never evicted, second is evicted after t, and third is evicted immediately?

A couple of other ideas for tests

  • same pod configuration as above, but add two taints T and S, and verify all three pods are evicted immediately
  • a pod that tolerates two taints, with different tolerationSeconds, then add both taints and verify that the pod is evicted after the min of the tolerationSeconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return func(args *WorkArgs) error {
ns := args.NamespacedName.Namespace
name := args.NamespacedName.Name
glog.V(0).Infof("NoExecuteTaintManager is deleting Pod: %v", args.NamespacedName.String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add an event here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// AddWork adds a work to the WorkerQueue which will be executed not earlier than `fireAt`.
func (q *TimedWorkerQueue) AddWork(args *WorkArgs, createdAt time.Time, fireAt time.Time) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add V(4) logging at entry to this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// CancelWork removes scheduled function execution from the queue.
func (q *TimedWorkerQueue) CancelWork(key string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add V(4) logging at entry to this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@davidopp
Copy link
Member

One other question -- I'm a bit rusty on how the Informer stuff works. Is it completely edge-based? That is, when node controller starts up, taint controller doesn't do anything until a pod or node is created/updated/deleted? Or is it level-based at startup, i.e. it somehow triggers the callbacks for every object in the system on startup? I think the behavior we want is that when NC starts up, it scans all the nodes and enqueues the relevant pods for eviction based on the node's taints and the pods' tolerations.

@liggitt
Copy link
Member

liggitt commented Feb 12, 2017

On startup, it gets an Add event for every existing object

@gmarek
Copy link
Contributor Author

gmarek commented Feb 13, 2017

@davidopp - I'll create a PR applying all your comments today. As for your question I think that what @liggitt wrote should suffice.

@davidopp
Copy link
Member

Thanks. And yeah, what @liggitt wrote answered my question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.