Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when increase kube-api-qps param in kube-controller-mananger, etcd2.2.2 out of memory #18266

Closed
magicwang-cn opened this issue Dec 7, 2015 · 93 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.

Comments

@magicwang-cn
Copy link
Contributor

@wojtek-t
i have created a k8s cluster with 400 nodes, 16000pods[40 pods per node] and use a 5 nodes etcd cluster[with 8G mem, 4 cpus]. and i tested for e2e density, found an etcd problem[OOM], this is the details:

first, definitely, this issue is cased by a param of kube-controller-manager(--kube-api-qps, from 20 to 100, and this can increase the rate of pods add/delete operation)
second, also my doubt, batch creating pods[create a rc with 16000 replicas] will not occur this problem, but batch deleting pods[scale the rc with 0 replicas] does(they both use etcd watch method).
final, it might be a misuse of etcd?

details:
in the process of deploying 16000 pods will not occur this problem, and the etcd mem use just about 800~900M(total 8G), the etcd snapshort is just about 46M
when i batch deleting pods , the etcd mem use increasing straight up, then OOM.

@wojtek-t
Copy link
Member

wojtek-t commented Dec 7, 2015

@xiang90 @hongchaodeng @gmarek

first, definitely, this issue is cased by a param of kube-controller-manager(--kube-api-qps, from 20 to 100, and this can increase the rate of pods add/delete operation)

If you are changing this param - then yes it can cause problems. Default values are not set accidentally - those are the values that we think work relatively well.

second, also my doubt, batch creating pods[create a rc with 16000 replicas] will not occur this problem, but batch deleting pods[scale the rc with 0 replicas] does(they both use etcd watch method).
final, it might be a misuse of etcd?

Sorry - I'm afraid I didn't understand this comment. Can you please clarify?

@magicwang-cn
Copy link
Contributor Author

@wojtek-t
thanks for you help.
I mean that the param is effective to both creating and deleting pods, why in the process of deploying 16000 will not occur this problem, but "kubectl scale --replicas=0“ does? What's causing this problem?

@wojtek-t
Copy link
Member

wojtek-t commented Dec 7, 2015

@magicwang-cn - hmm... I'm not sure.

details:
in the process of deploying 16000 pods will not occur this problem, and the etcd mem use just about 800~900M(total 8G), the etcd snapshort is just about 46M
when i batch deleting pods , the etcd mem use increasing straight up, then OOM.

@xiang90 @hongchaodeng - are folks working also on etcd - any thoughts on it?

@gmarek
Copy link
Contributor

gmarek commented Dec 7, 2015

During my tests I also saw growth in the etcd resource usage during deletion, though they never were that bad. AFAICT pod addition and deletion are symmetric operation, so there shouldn't be any difference on our part (but I may be missing something @lavalamp @mikedanese).

To make your setup crystal clear: you're running 5 nodes for etcd 1 node for master (API server etc.), right? We're generally not running distributed etcd now, so it may be some trait of Raft. Note that I'm shooting in the dark here.

@magicwang-cn
Copy link
Contributor Author

@gmarek @wojtek-t @xiang90
you're running 5 nodes for etcd 1 node for master (API server etc.), right?--- yes
and this issue has the etcd log
i do not think it is the trait of Raft, very likely has relation with etcd watch.

@wojtek-t
Copy link
Member

wojtek-t commented Dec 7, 2015

I don't think it's related to watch, because there is no difference between create & delete "event" from the watch perspective...

@a-robinson a-robinson added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Dec 7, 2015
@lavalamp
Copy link
Member

lavalamp commented Dec 8, 2015

We now do "graceful" deletion by default-- I'm not sure, but that may add some multiplier for etcd reads/writes. @smarterclayton ?

@smarterclayton
Copy link
Contributor

smarterclayton commented Dec 8, 2015 via email

@smarterclayton
Copy link
Contributor

Also, if you create and delete pods rapidly, you may incur a high level of conflict amplification (since initial pod create is followed by several writes from scheduler and Kubelet). Multiple clients are writing and retrying in that window.

@lavalamp
Copy link
Member

lavalamp commented Dec 8, 2015

Thanks, to me that sounds sufficient to explain the reported difference between pod creates and deletes.

@magicwang-cn
Copy link
Contributor Author

@smarterclayton @lavalamp
so compare to creating pods, deleting pods have more of the following steps with multipier:
read obj before delete
graceful delete
status change
am i right?

@lavalamp
Copy link
Member

lavalamp commented Dec 8, 2015

Right. So the maximum safe deletion rate may be slower than the maximum
safe creation rate.

On Mon, Dec 7, 2015 at 6:42 PM, wangbo [email protected] wrote:

@smarterclayton https://github.com/smarterclayton @lavalamp
https://github.com/lavalamp
so compare to creating pods, deleting pods have more of the following
steps with multipier:
read obj before delete
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/generic/etcd/etcd.go#L382
graceful delete
https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/generic/etcd/etcd.go#L397-L432
status change
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet.go#L1922
am i right?


Reply to this email directly or view it on GitHub
#18266 (comment)
.

@magicwang-cn
Copy link
Contributor Author

@lavalamp @smarterclayton @xiang90
so is it necessary to seperate creating rate and deleting rate?

@lavalamp
Copy link
Member

lavalamp commented Jan 4, 2016

It's a good question. I don't think it's worth the complexity cost at the
moment. Ideally we would want to make an analogous change in all
controllers, not just replication controller, which to me says apiserver
should itself do the rate limiting and push that back onto clients with
429s.

On Thu, Dec 10, 2015 at 9:57 PM, wangbo [email protected] wrote:

@lavalamp https://github.com/lavalamp @smarterclayton
https://github.com/smarterclayton @xiang90 https://github.com/xiang90
so is it necessary to seperate creating rate and deleting rate?


Reply to this email directly or view it on GitHub
#18266 (comment)
.

@Random-Liu
Copy link
Member

@yujuhong Is this related to what we discussed before about the graceful deletion implementation on Kubelet side?

@yujuhong
Copy link
Contributor

yujuhong commented Jan 4, 2016

@Random-Liu, there are suboptimal logic in kubelet's status update that could cause extra GETs after a pod has been deleted, leading to even slower batch deletion. However, I don't think it'd affect the etcd memory usage.

@magicwang-cn
Copy link
Contributor Author

@lavalamp @smarterclayton @hongchaodeng @xiang90
since i have a look at etcd's code, i found that once etcd receives a request, it will clone the store's data, then write the data into response, @hongchaodeng am i right? see the code
so as @smarterclayton said, when deleting, it adds 2-3 writes in the happy path. the clone data must be large, this maybe the cause.
etcd_mem

@magicwang-cn
Copy link
Contributor Author

@lavalamp @smarterclayton @hongchaodeng @xiang90 @HardySimpson
what's more, when begin to deleting, the prometheus data of etcd is as follows(with a high growth):
etcd_goroutines
etcd_mem_alloc
etcd_mem_heap_inuse
etcd_mem_malloc
etcd_resident_mem
etcd_whole

@lavalamp
Copy link
Member

lavalamp commented Jan 6, 2016

Thanks for the charts. Judging from the stats, it looks like deletes are
not the culprit, since there were only 1000 of them. There's almost 200k
compare&swaps. So I think our "flurry of writes right before delete" theory
is probably correct. That, or deletes are REALLY expensive.

On Wed, Jan 6, 2016 at 5:06 AM, wangbo [email protected] wrote:

what's more, when begin to deleting, the prometheus data of etcd is as
follows(with a high growth):
[image: etcd_goroutines]
https://cloud.githubusercontent.com/assets/7587313/12140095/15b23030-b4a0-11e5-9249-7ac83c07b8b6.png
[image: etcd_mem_alloc]
https://cloud.githubusercontent.com/assets/7587313/12140099/15e4ff6a-b4a0-11e5-89dc-b028168f9c32.png
[image: etcd_mem_heap_inuse]
https://cloud.githubusercontent.com/assets/7587313/12140100/15e7152a-b4a0-11e5-924c-828d46708023.png
[image: etcd_mem_malloc]
https://cloud.githubusercontent.com/assets/7587313/12140098/15e42c8e-b4a0-11e5-91fe-f2c17d851927.png
[image: etcd_resident_mem]
https://cloud.githubusercontent.com/assets/7587313/12140096/15b5c876-b4a0-11e5-8a4b-a97c0437d8bf.png
[image: etcd_whole]
https://cloud.githubusercontent.com/assets/7587313/12140097/15b6bbc8-b4a0-11e5-89f5-195c398629a1.png


Reply to this email directly or view it on GitHub
#18266 (comment)
.

@xiang90
Copy link
Contributor

xiang90 commented Jan 6, 2016

@magicwang-cn Can you also get the store operation (reads, writes, delete, etc..) rate vs time graph? Then we can probably know which operation causes the spike usage.

@xiang90
Copy link
Contributor

xiang90 commented Jan 6, 2016

And it seems the time scale of the graphs are not the same.

@magicwang-cn
Copy link
Contributor Author

OK, i will do this as soon as possible

@magicwang-cn Can you also get the store operation (reads, writes, delete, etc..) rate vs time graph? Then we can probably know which operation causes the spike usage.

no, they are the data for the same period

And it seems the time scale of the graphs are not the same.

@xiang90 @lavalamp

@magicwang-cn
Copy link
Contributor Author

@xiang90 @lavalamp @hongchaodeng @smarterclayton @HardySimpson
this is the store operation graph:
etcd_reads_total
etcd_watch_request
etcd_writes_compareandswap
etcd_writes_create
etcd_writes_delete
etcd_writes_set

@xiang90
Copy link
Contributor

xiang90 commented Jan 7, 2016

@magicwang-cn It seems like the 1000 concurrent deletes was killing etcd. Can you find a way to rate limit it and give it another try? Or can you try to give more memory to etcd to see if it survives?

@magicwang-cn
Copy link
Contributor Author

@xiang90
yeah, apiserver's rate limit may solve this problem, i am planning to have a test(change the --kube-api-qps param of controller-manager from 20 to 10).
but every request will do a clone operation may not be a good way? or etcd itself can do the rate limit to keep surviving?

@magicwang-cn
Copy link
Contributor Author

@xiang90 @lavalamp
first method(change the --kube-api-qps param of controller-manager from 20 to 10), the etcd problem still exist.

@xiang90
Copy link
Contributor

xiang90 commented Jan 7, 2016

@magicwang-cn If you look at the graph you generated, the delete rate is apparently much higher than 20 or 10. It is at the order of 100.

@magicwang-cn
Copy link
Contributor Author

@xiang90 no, just about 10/s , you can see as follows:
etcd_delete_begin
etcd_delete_end

@magicwang-cn If you look at the graph you generated, the delete rate is apparently much higher than 20 or 10. It is at the order of 100.

@xiang90
Copy link
Contributor

xiang90 commented Jan 7, 2016

@magicwang-cn I still cannot see why it is 10 by the graph you just showed me. I think you showed me the wrong graph? There are action, group, instance, job on the graph but no rate.

@magicwang-cn
Copy link
Contributor Author

/cc @davidopp @alfred-huangjian

@smarterclayton
Copy link
Contributor

Memory in etcd and master server will continue to be optimized through protobuf over the next few releases, so we'll be reducing a significant amount of I/O and garbage. Etcd3 should also reduce total memory use.

@timothysc
Copy link
Member

@hongchaodeng @magicwang-cn you should see the error in the logs, is the log data available?

Also if your theory is correct, we should be able to repro with a UT on cache layer.

@magicwang-cn
Copy link
Contributor Author

@smarterclayton already in the plan?

@timothysc the log is (too old resourceVersion)

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 20, 2016 via email

@lavalamp
Copy link
Member

@hongchaodeng

Anyone could help explain why the client needs to relist if the watch returns error? It sounds to me that client is behaving too aggressive.

If a watch returns an error, that means changes have been made since the client listed. If the client needs to have a 100% correct state, it must then relist. Otherwise, it may never return to a correct state (imagine one of the missed changes was a deletion).

A flood of lists is quite possibly a cause of the problem, but it's not the ultimate cause. The question is why did the previous watch close? Whatever causes watches to close prematurely is the thing that needs to be fixed.

@hongchaodeng
Copy link
Contributor

@lavalamp
See my analysis above.

why did the previous watch close

There are three identities:

  • client
  • cacher with a reflector updating watch cache
  • watch cache inside cacher

Here's detailed error analysis:

  • Watch cache holds fixed size of a queue of data.
  • Watch cache is update really fast because there are a lot more state changes than it can hold
  • Cacher returns error in Watch() because watch cache didn't contain the old version data.
  • Client watch is closed because cacher returns an error (too old resource version).
  • Client relists after watch is closed and it causes flood to etcd

@hongchaodeng
Copy link
Contributor

FYI, I'm trying to reproduce it in UT.

@lavalamp
Copy link
Member

@hongchaodeng I was responding to your analysis.

There's two cases.

a) a list has just completed and we're attempting to establish a watch.
b) a watch has been initiated and we're trying to keep up.

The situation you're outlining is a potential problem for case a), when we're trying to get a watch started. But that means we've just done a list for some reason-- why did we need to do that list at all? My claim is that something must have gone wrong in steady state, b).

@hongchaodeng
Copy link
Contributor

@gmarek @timothysc

I talked with @lavalamp today w.r.t. this issue. Let me give a summary:

  • Daniel is worried that the cluster state has been somehow unsteady. That's the point that we should figure out and maintain it steady.
  • Daniel pointed out that even though the cache size has a limit, current etcd v2 also has a limit on the history of queue. Fortunately, etcd v3 has a theoretically infinite list (LSM tree) and different semantics. So I would research this topic further and give more analysis.
  • Increasing cache size should be a quick duct tape. @magicwang-cn told me he's OK with that.

@hongchaodeng
Copy link
Contributor

At the mean time, I could verify:

  • the Watch() error of too old resource version could happen
  • List() goes directly to etcd happened here for a number of resources like "services/specs", "pods", "minions", "nodes".

I still think that the outdated version of watch leads to such problematic unsteady state. Unfortunately I couldn't find where the client logic. I will continue to research on potential clients and have whole-story analysis of client logic as well.

@lavalamp
Copy link
Member

List of pods should be served out of the cache. The line you linked to is
for clients that don't set the resource version; our own components ought
to be setting the resource version.

On Thu, Jan 21, 2016 at 5:33 PM, Hongchao Deng [email protected]
wrote:

At the mean time, I could verify:

  • the Watch() error of too old resource version could happen
  • List() goes directly to etcd happened here
    return c.storage.List(ctx, key, resourceVersion, filter, listObj)

    for a number of resources like "services/specs", "pods", "minions", "nodes".

I still think that the outdated version of watch leads to such problematic
unsteady state. Unfortunately I couldn't find where the client logic. I
will continue to research on potential clients and have whole-story
analysis of client logic as well.


Reply to this email directly or view it on GitHub
#18266 (comment)
.

@magicwang-cn
Copy link
Contributor Author

@lavalamp yes, i see that even though the Watch() failed with (too old resourceVersion), if the resourceVersion is setted, the List() also reads data from cache. but from the graph i have intercepted before, when deleting pods, the list action has an obviously increasing.
case by watch list itself?

@lavalamp
Copy link
Member

@magicwang-cn I'm sorry, can you rephrase? I don't understand

@magicwang-cn
Copy link
Contributor Author

@lavalamp i mean if the resource version is setted, even though the watch() failed with too old resource version, the list() still reads data from cache
but from the graph i have got before, when deleting pods, the list() action has an obviously increasing(see as follows):

kube-apiserver list watch graph:
apiserver_listwatch_begin
apiserver_listwatch_end

so anywhere else case the problem? maybe here?

@lavalamp
Copy link
Member

@magicwang-cn I still don't think I understand what you're saying. Note that the list handler returns a different resource version than it gets-- if you send "0", it sends back readResourceVersion, which should be the resource version the cache is actually at. If you then do a watch from readResourceVersion, you should be in the window.

The charts with a huge spike in lists tells me that there was a correlated watch failure across multiple components. The lists themselves are a symptom of the underlying cause.

@timothysc
Copy link
Member

@magicwang-cn what objects are you failing on in the cache? Is it consistent?

xref: #19970

@magicwang-cn
Copy link
Contributor Author

@lavalamp @hongchaodeng @timothysc
today i have found the potential clients that send the List() request without set (resourceVersion=0), mainly in density test itself(especial DeleteRC, this will list pods from etcd with a long time), for example:
DeleteRC
start

and i have printed the log that behind this line, find there are still many clients use List() request without set resourceVersion.

so what's the
consider of this judgment

@lavalamp
Copy link
Member

It would be useful to log the client there so we can get a list of which
components need to be updated.

I think only clients on every node really matter-- I don't think the e2e
tests themselves will make enough traffic to harm the server, but I still
think it'd be good to fix them.

On Tue, Jan 26, 2016 at 5:11 AM, wangbo [email protected] wrote:

@lavalamp https://github.com/lavalamp @hongchaodeng
https://github.com/hongchaodeng @timothysc
https://github.com/timothysc
today i have found the potential clients that send the List() request
without set (resourceVersion=0), mainly in density test itself(especial
DeleteRC, this will list pods from etcd with a long time), for example:
DeleteRC
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/util.go#L1921
start
https://github.com/kubernetes/kubernetes/blob/master/test/e2e/util.go#L1694

and i have printed the log that behind this line
https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/cacher.go#L283,
find there are still many clients use List() request without set
resourceVersion.

so what's the
consider of this judgment
https://github.com/kubernetes/kubernetes/blob/master/pkg/storage/cacher.go#L281-#L285


Reply to this email directly or view it on GitHub
#18266 (comment)
.

@timothysc
Copy link
Member

It would be useful to log the client there so we can get a list of which components need to be updated.

1 , also I wouldn't mind a V(4) log on the line above to diagnose.

@wojtek-t
Copy link
Member

wojtek-t commented Feb 3, 2016

Sorry for jumping late into discussion - I was on paternity leave for a month.

Generally, I kind of agree with Daniel, that if increasing the size of cache helps, we have much more serious problem and we shouldn't fix it by changing cache sizes.

  • Watch cache holds fixed size of a queue of data.
  • Watch cache is update really fast because there are a lot more state changes than it can hold
  • Cacher returns error in Watch() because watch cache didn't contain the old version data.
  • Client watch is closed because cacher returns an error (too old resource version).
  • Client relists after watch is closed and it causes flood to etcd

The question is exactly why "Cacher" returns error in Watch - it definitely shouldn't.
Basically, watch event is emitted to watchers synchronously with reading it from etcd. So this should never happen.
I'm happy to help with debugging this, but I need a better way for reproducing it.

@spzala
Copy link
Member

spzala commented Jul 5, 2017

@magicwang-cn hello, since this issue is over a year old and also etcd 3.2 is out - can we please verify if this issue is still valid? If not, can we please close it? Related issue #20540 has a comment that etcd3 should solve problems related to large cluster. Thanks!

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 30, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability.
Projects
None yet
Development

No branches or pull requests