when increase kube-api-qps param in kube-controller-mananger, etcd2.2.2 out of memory #18266

magicwang-cn · 2015-12-07T02:10:37Z

@wojtek-t
i have created a k8s cluster with 400 nodes, 16000pods[40 pods per node] and use a 5 nodes etcd cluster[with 8G mem, 4 cpus]. and i tested for e2e density, found an etcd problem[OOM], this is the details:

first, definitely, this issue is cased by a param of kube-controller-manager(--kube-api-qps, from 20 to 100, and this can increase the rate of pods add/delete operation)
second, also my doubt, batch creating pods[create a rc with 16000 replicas] will not occur this problem, but batch deleting pods[scale the rc with 0 replicas] does(they both use etcd watch method).
final, it might be a misuse of etcd?

details:
in the process of deploying 16000 pods will not occur this problem, and the etcd mem use just about 800~900M(total 8G), the etcd snapshort is just about 46M
when i batch deleting pods , the etcd mem use increasing straight up, then OOM.

wojtek-t · 2015-12-07T07:28:05Z

@xiang90 @hongchaodeng @gmarek

first, definitely, this issue is cased by a param of kube-controller-manager(--kube-api-qps, from 20 to 100, and this can increase the rate of pods add/delete operation)

If you are changing this param - then yes it can cause problems. Default values are not set accidentally - those are the values that we think work relatively well.

second, also my doubt, batch creating pods[create a rc with 16000 replicas] will not occur this problem, but batch deleting pods[scale the rc with 0 replicas] does(they both use etcd watch method).
final, it might be a misuse of etcd?

Sorry - I'm afraid I didn't understand this comment. Can you please clarify?

magicwang-cn · 2015-12-07T07:45:15Z

@wojtek-t
thanks for you help.
I mean that the param is effective to both creating and deleting pods, why in the process of deploying 16000 will not occur this problem, but "kubectl scale --replicas=0“ does? What's causing this problem?

wojtek-t · 2015-12-07T08:03:15Z

@magicwang-cn - hmm... I'm not sure.

details:
in the process of deploying 16000 pods will not occur this problem, and the etcd mem use just about 800~900M(total 8G), the etcd snapshort is just about 46M
when i batch deleting pods , the etcd mem use increasing straight up, then OOM.

@xiang90 @hongchaodeng - are folks working also on etcd - any thoughts on it?

gmarek · 2015-12-07T08:54:37Z

During my tests I also saw growth in the etcd resource usage during deletion, though they never were that bad. AFAICT pod addition and deletion are symmetric operation, so there shouldn't be any difference on our part (but I may be missing something @lavalamp @mikedanese).

To make your setup crystal clear: you're running 5 nodes for etcd 1 node for master (API server etc.), right? We're generally not running distributed etcd now, so it may be some trait of Raft. Note that I'm shooting in the dark here.

magicwang-cn · 2015-12-07T09:33:02Z

@gmarek @wojtek-t @xiang90
you're running 5 nodes for etcd 1 node for master (API server etc.), right?--- yes
and this issue has the etcd log
i do not think it is the trait of Raft, very likely has relation with etcd watch.

wojtek-t · 2015-12-07T09:41:15Z

I don't think it's related to watch, because there is no difference between create & delete "event" from the watch perspective...

lavalamp · 2015-12-08T00:12:44Z

We now do "graceful" deletion by default-- I'm not sure, but that may add some multiplier for etcd reads/writes. @smarterclayton ?

smarterclayton · 2015-12-08T00:36:16Z

Yeah it adds 2-3 writes in the happy path. Read object (instead of previous blind delete) Write graceful delete interval, timestamp Kubelet writes status change (container shutdown) Kubelet invokes hard delete Depending on how Kubelet shuts down it can do multiple status writes.

…

We now do "graceful" deletion by default-- I'm not sure, but that may add some multiplier for etcd reads/writes.

smarterclayton · 2015-12-08T00:39:29Z

Also, if you create and delete pods rapidly, you may incur a high level of conflict amplification (since initial pod create is followed by several writes from scheduler and Kubelet). Multiple clients are writing and retrying in that window.

lavalamp · 2015-12-08T00:43:38Z

Thanks, to me that sounds sufficient to explain the reported difference between pod creates and deletes.

magicwang-cn · 2015-12-08T02:42:37Z

@smarterclayton @lavalamp
so compare to creating pods, deleting pods have more of the following steps with multipier:
read obj before delete
graceful delete
status change
am i right?

lavalamp · 2015-12-08T04:09:29Z

Right. So the maximum safe deletion rate may be slower than the maximum
safe creation rate.

On Mon, Dec 7, 2015 at 6:42 PM, wangbo [email protected] wrote:

@smarterclayton https://github.com/smarterclayton @lavalamp
https://github.com/lavalamp
so compare to creating pods, deleting pods have more of the following
steps with multipier:
read obj before delete
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/generic/etcd/etcd.go#L382
graceful delete
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/generic/etcd/etcd.go#L397-L432
status change
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1922
am i right?

—
Reply to this email directly or view it on GitHub
#18266 (comment)
.

magicwang-cn · 2015-12-11T03:56:57Z

@lavalamp @smarterclayton @xiang90
so is it necessary to seperate creating rate and deleting rate?

lavalamp · 2016-01-04T17:51:55Z

It's a good question. I don't think it's worth the complexity cost at the
moment. Ideally we would want to make an analogous change in all
controllers, not just replication controller, which to me says apiserver
should itself do the rate limiting and push that back onto clients with
429s.

On Thu, Dec 10, 2015 at 9:57 PM, wangbo [email protected] wrote:

@lavalamp https://github.com/lavalamp @smarterclayton
https://github.com/smarterclayton @xiang90 https://github.com/xiang90
so is it necessary to seperate creating rate and deleting rate?

—
Reply to this email directly or view it on GitHub
#18266 (comment)
.

Random-Liu · 2016-01-04T19:14:23Z

@yujuhong Is this related to what we discussed before about the graceful deletion implementation on Kubelet side?

yujuhong · 2016-01-04T19:32:10Z

@Random-Liu, there are suboptimal logic in kubelet's status update that could cause extra GETs after a pod has been deleted, leading to even slower batch deletion. However, I don't think it'd affect the etcd memory usage.

magicwang-cn · 2016-01-06T08:17:46Z

@lavalamp @smarterclayton @hongchaodeng @xiang90
since i have a look at etcd's code, i found that once etcd receives a request, it will clone the store's data, then write the data into response, @hongchaodeng am i right? see the code
so as @smarterclayton said, when deleting, it adds 2-3 writes in the happy path. the clone data must be large, this maybe the cause.

magicwang-cn · 2016-01-06T10:06:36Z

@lavalamp @smarterclayton @hongchaodeng @xiang90 @HardySimpson
what's more, when begin to deleting, the prometheus data of etcd is as follows(with a high growth):

lavalamp · 2016-01-06T14:30:10Z

Thanks for the charts. Judging from the stats, it looks like deletes are
not the culprit, since there were only 1000 of them. There's almost 200k
compare&swaps. So I think our "flurry of writes right before delete" theory
is probably correct. That, or deletes are REALLY expensive.

On Wed, Jan 6, 2016 at 5:06 AM, wangbo [email protected] wrote:

what's more, when begin to deleting, the prometheus data of etcd is as
follows(with a high growth):
[image: etcd_goroutines]
https://cloud.githubusercontent.com/assets/7587313/12140095/15b23030-b4a0-11e5-9249-7ac83c07b8b6.png
[image: etcd_mem_alloc]
https://cloud.githubusercontent.com/assets/7587313/12140099/15e4ff6a-b4a0-11e5-89dc-b028168f9c32.png
[image: etcd_mem_heap_inuse]
https://cloud.githubusercontent.com/assets/7587313/12140100/15e7152a-b4a0-11e5-924c-828d46708023.png
[image: etcd_mem_malloc]
https://cloud.githubusercontent.com/assets/7587313/12140098/15e42c8e-b4a0-11e5-91fe-f2c17d851927.png
[image: etcd_resident_mem]
https://cloud.githubusercontent.com/assets/7587313/12140096/15b5c876-b4a0-11e5-8a4b-a97c0437d8bf.png
[image: etcd_whole]
https://cloud.githubusercontent.com/assets/7587313/12140097/15b6bbc8-b4a0-11e5-89f5-195c398629a1.png

—
Reply to this email directly or view it on GitHub
#18266 (comment)
.

xiang90 · 2016-01-06T18:04:13Z

@magicwang-cn Can you also get the store operation (reads, writes, delete, etc..) rate vs time graph? Then we can probably know which operation causes the spike usage.

xiang90 · 2016-01-06T18:06:03Z

And it seems the time scale of the graphs are not the same.

magicwang-cn · 2016-01-07T01:44:36Z

OK, i will do this as soon as possible

@magicwang-cn Can you also get the store operation (reads, writes, delete, etc..) rate vs time graph? Then we can probably know which operation causes the spike usage.

no, they are the data for the same period

And it seems the time scale of the graphs are not the same.

@xiang90 @lavalamp

magicwang-cn · 2016-01-07T02:26:26Z

@xiang90 @lavalamp @hongchaodeng @smarterclayton @HardySimpson
this is the store operation graph:

xiang90 · 2016-01-07T04:45:00Z

@magicwang-cn It seems like the 1000 concurrent deletes was killing etcd. Can you find a way to rate limit it and give it another try? Or can you try to give more memory to etcd to see if it survives?

magicwang-cn · 2016-01-07T05:03:29Z

@xiang90
yeah, apiserver's rate limit may solve this problem, i am planning to have a test(change the --kube-api-qps param of controller-manager from 20 to 10).
but every request will do a clone operation may not be a good way? or etcd itself can do the rate limit to keep surviving?

magicwang-cn · 2016-01-07T07:15:34Z

@xiang90 @lavalamp
first method(change the --kube-api-qps param of controller-manager from 20 to 10), the etcd problem still exist.

xiang90 · 2016-01-07T07:20:18Z

@magicwang-cn If you look at the graph you generated, the delete rate is apparently much higher than 20 or 10. It is at the order of 100.

magicwang-cn · 2016-01-07T07:28:17Z

@xiang90 no, just about 10/s , you can see as follows:

@magicwang-cn If you look at the graph you generated, the delete rate is apparently much higher than 20 or 10. It is at the order of 100.

xiang90 · 2016-01-07T07:36:11Z

@magicwang-cn I still cannot see why it is 10 by the graph you just showed me. I think you showed me the wrong graph? There are action, group, instance, job on the graph but no rate.

magicwang-cn · 2016-01-19T11:02:40Z

/cc @davidopp @alfred-huangjian

smarterclayton · 2016-01-19T19:02:25Z

Memory in etcd and master server will continue to be optimized through protobuf over the next few releases, so we'll be reducing a significant amount of I/O and garbage. Etcd3 should also reduce total memory use.

timothysc · 2016-01-19T20:18:30Z

@hongchaodeng @magicwang-cn you should see the error in the logs, is the log data available?

Also if your theory is correct, we should be able to repro with a UT on cache layer.

magicwang-cn · 2016-01-20T01:13:20Z

@smarterclayton already in the plan?

@timothysc the log is (too old resourceVersion)

smarterclayton · 2016-01-20T03:02:22Z

It'll be 1.3 at this point, merely experimental until then

lavalamp · 2016-01-20T22:43:08Z

@hongchaodeng

Anyone could help explain why the client needs to relist if the watch returns error? It sounds to me that client is behaving too aggressive.

If a watch returns an error, that means changes have been made since the client listed. If the client needs to have a 100% correct state, it must then relist. Otherwise, it may never return to a correct state (imagine one of the missed changes was a deletion).

A flood of lists is quite possibly a cause of the problem, but it's not the ultimate cause. The question is why did the previous watch close? Whatever causes watches to close prematurely is the thing that needs to be fixed.

hongchaodeng · 2016-01-20T22:54:45Z

@lavalamp
See my analysis above.

why did the previous watch close

There are three identities:

client
cacher with a reflector updating watch cache
watch cache inside cacher

Here's detailed error analysis:

Watch cache holds fixed size of a queue of data.
Watch cache is update really fast because there are a lot more state changes than it can hold
Cacher returns error in Watch() because watch cache didn't contain the old version data.
Client watch is closed because cacher returns an error (too old resource version).
Client relists after watch is closed and it causes flood to etcd

hongchaodeng · 2016-01-20T23:00:00Z

FYI, I'm trying to reproduce it in UT.

lavalamp · 2016-01-20T23:03:45Z

@hongchaodeng I was responding to your analysis.

There's two cases.

a) a list has just completed and we're attempting to establish a watch.
b) a watch has been initiated and we're trying to keep up.

The situation you're outlining is a potential problem for case a), when we're trying to get a watch started. But that means we've just done a list for some reason-- why did we need to do that list at all? My claim is that something must have gone wrong in steady state, b).

hongchaodeng · 2016-01-22T01:27:59Z

@gmarek @timothysc

I talked with @lavalamp today w.r.t. this issue. Let me give a summary:

Daniel is worried that the cluster state has been somehow unsteady. That's the point that we should figure out and maintain it steady.
Daniel pointed out that even though the cache size has a limit, current etcd v2 also has a limit on the history of queue. Fortunately, etcd v3 has a theoretically infinite list (LSM tree) and different semantics. So I would research this topic further and give more analysis.
Increasing cache size should be a quick duct tape. @magicwang-cn told me he's OK with that.

hongchaodeng · 2016-01-22T01:33:08Z

At the mean time, I could verify:

the Watch() error of too old resource version could happen
List() goes directly to etcd happened here for a number of resources like "services/specs", "pods", "minions", "nodes".

I still think that the outdated version of watch leads to such problematic unsteady state. Unfortunately I couldn't find where the client logic. I will continue to research on potential clients and have whole-story analysis of client logic as well.

lavalamp · 2016-01-22T01:45:45Z

List of pods should be served out of the cache. The line you linked to is
for clients that don't set the resource version; our own components ought
to be setting the resource version.

On Thu, Jan 21, 2016 at 5:33 PM, Hongchao Deng [email protected]
wrote:

At the mean time, I could verify:

the Watch() error of too old resource version could happen

List() goes directly to etcd happened here

kubernetes/pkg/storage/cacher.go

Line 284 in 60fc2bc

return c.storage.List(ctx, key, resourceVersion, filter, listObj)

for a number of resources like "services/specs", "pods", "minions", "nodes".

I still think that the outdated version of watch leads to such problematic
unsteady state. Unfortunately I couldn't find where the client logic. I
will continue to research on potential clients and have whole-story
analysis of client logic as well.

—
Reply to this email directly or view it on GitHub
#18266 (comment)
.

magicwang-cn · 2016-01-22T01:57:13Z

@lavalamp yes, i see that even though the Watch() failed with (too old resourceVersion), if the resourceVersion is setted, the List() also reads data from cache. but from the graph i have intercepted before, when deleting pods, the list action has an obviously increasing.
case by watch list itself?

lavalamp · 2016-01-22T19:33:39Z

@magicwang-cn I'm sorry, can you rephrase? I don't understand

magicwang-cn · 2016-01-23T02:14:26Z

@lavalamp i mean if the resource version is setted, even though the watch() failed with too old resource version, the list() still reads data from cache
but from the graph i have got before, when deleting pods, the list() action has an obviously increasing(see as follows):

kube-apiserver list watch graph:

so anywhere else case the problem? maybe here?

lavalamp · 2016-01-25T19:06:36Z

@magicwang-cn I still don't think I understand what you're saying. Note that the list handler returns a different resource version than it gets-- if you send "0", it sends back readResourceVersion, which should be the resource version the cache is actually at. If you then do a watch from readResourceVersion, you should be in the window.

The charts with a huge spike in lists tells me that there was a correlated watch failure across multiple components. The lists themselves are a symptom of the underlying cause.

timothysc · 2016-01-25T20:31:51Z

@magicwang-cn what objects are you failing on in the cache? Is it consistent?

xref: #19970

magicwang-cn · 2016-01-26T13:11:47Z

@lavalamp @hongchaodeng @timothysc
today i have found the potential clients that send the List() request without set (resourceVersion=0), mainly in density test itself(especial DeleteRC, this will list pods from etcd with a long time), for example:
DeleteRC
start

and i have printed the log that behind this line, find there are still many clients use List() request without set resourceVersion.

so what's the
consider of this judgment

lavalamp · 2016-01-26T18:30:24Z

It would be useful to log the client there so we can get a list of which
components need to be updated.

I think only clients on every node really matter-- I don't think the e2e
tests themselves will make enough traffic to harm the server, but I still
think it'd be good to fix them.

On Tue, Jan 26, 2016 at 5:11 AM, wangbo [email protected] wrote:

@lavalamp https://github.com/lavalamp @hongchaodeng
https://github.com/hongchaodeng @timothysc
https://github.com/timothysc
today i have found the potential clients that send the List() request
without set (resourceVersion=0), mainly in density test itself(especial
DeleteRC, this will list pods from etcd with a long time), for example:
DeleteRC
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/util.go#L1921
start
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/util.go#L1694

and i have printed the log that behind this line
https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/cacher.go#L283,
find there are still many clients use List() request without set
resourceVersion.

so what's the
consider of this judgment
https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/cacher.go#L281-#L285

—
Reply to this email directly or view it on GitHub
#18266 (comment)
.

timothysc · 2016-01-27T16:28:49Z

It would be useful to log the client there so we can get a list of which components need to be updated.

1 , also I wouldn't mind a V(4) log on the line above to diagnose.

wojtek-t · 2016-02-03T13:23:24Z

Sorry for jumping late into discussion - I was on paternity leave for a month.

Generally, I kind of agree with Daniel, that if increasing the size of cache helps, we have much more serious problem and we shouldn't fix it by changing cache sizes.

Watch cache holds fixed size of a queue of data.

Watch cache is update really fast because there are a lot more state changes than it can hold

Cacher returns error in Watch() because watch cache didn't contain the old version data.

Client watch is closed because cacher returns an error (too old resource version).

Client relists after watch is closed and it causes flood to etcd

The question is exactly why "Cacher" returns error in Watch - it definitely shouldn't.
Basically, watch event is emitted to watchers synchronously with reading it from etcd. So this should never happen.
I'm happy to help with debugging this, but I need a better way for reproducing it.

spzala · 2017-07-05T15:04:17Z

@magicwang-cn hello, since this issue is over a year old and also etcd 3.2 is out - can we please verify if this issue is still valid? If not, can we please close it? Related issue #20540 has a comment that etcd3 should solve problems related to large cluster. Thanks!

fejta-bot · 2017-12-31T07:06:45Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-30T07:14:34Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-03-01T08:00:50Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

a-robinson added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Dec 7, 2015

magicwang-cn mentioned this issue Feb 4, 2016

ETCD becomes the bottleneck of cluster when we have 1K nodes #20540

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 30, 2018

k8s-ci-robot closed this as completed Mar 1, 2018

when increase kube-api-qps param in kube-controller-mananger, etcd2.2.2 out of memory #18266

when increase kube-api-qps param in kube-controller-mananger, etcd2.2.2 out of memory #18266

Comments

magicwang-cn commented Dec 7, 2015

wojtek-t commented Dec 7, 2015

magicwang-cn commented Dec 7, 2015

wojtek-t commented Dec 7, 2015

gmarek commented Dec 7, 2015

magicwang-cn commented Dec 7, 2015

wojtek-t commented Dec 7, 2015

lavalamp commented Dec 8, 2015

smarterclayton commented Dec 8, 2015 via email

smarterclayton commented Dec 8, 2015

lavalamp commented Dec 8, 2015

magicwang-cn commented Dec 8, 2015

lavalamp commented Dec 8, 2015

magicwang-cn commented Dec 11, 2015

lavalamp commented Jan 4, 2016

Random-Liu commented Jan 4, 2016

yujuhong commented Jan 4, 2016

magicwang-cn commented Jan 6, 2016

magicwang-cn commented Jan 6, 2016

lavalamp commented Jan 6, 2016

xiang90 commented Jan 6, 2016

xiang90 commented Jan 6, 2016

magicwang-cn commented Jan 7, 2016

magicwang-cn commented Jan 7, 2016

xiang90 commented Jan 7, 2016

magicwang-cn commented Jan 7, 2016

magicwang-cn commented Jan 7, 2016

xiang90 commented Jan 7, 2016

magicwang-cn commented Jan 7, 2016

xiang90 commented Jan 7, 2016

magicwang-cn commented Jan 19, 2016

smarterclayton commented Jan 19, 2016

timothysc commented Jan 19, 2016

magicwang-cn commented Jan 20, 2016

smarterclayton commented Jan 20, 2016 via email

lavalamp commented Jan 20, 2016

hongchaodeng commented Jan 20, 2016

hongchaodeng commented Jan 20, 2016

lavalamp commented Jan 20, 2016

hongchaodeng commented Jan 22, 2016

hongchaodeng commented Jan 22, 2016

lavalamp commented Jan 22, 2016

magicwang-cn commented Jan 22, 2016

lavalamp commented Jan 22, 2016

magicwang-cn commented Jan 23, 2016

lavalamp commented Jan 25, 2016

timothysc commented Jan 25, 2016

magicwang-cn commented Jan 26, 2016

lavalamp commented Jan 26, 2016

timothysc commented Jan 27, 2016

wojtek-t commented Feb 3, 2016

spzala commented Jul 5, 2017

fejta-bot commented Dec 31, 2017

fejta-bot commented Jan 30, 2018

fejta-bot commented Mar 1, 2018