-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when increase kube-api-qps param in kube-controller-mananger, etcd2.2.2 out of memory #18266
Comments
@xiang90 @hongchaodeng @gmarek
If you are changing this param - then yes it can cause problems. Default values are not set accidentally - those are the values that we think work relatively well.
Sorry - I'm afraid I didn't understand this comment. Can you please clarify? |
@wojtek-t |
@magicwang-cn - hmm... I'm not sure.
@xiang90 @hongchaodeng - are folks working also on etcd - any thoughts on it? |
During my tests I also saw growth in the etcd resource usage during deletion, though they never were that bad. AFAICT pod addition and deletion are symmetric operation, so there shouldn't be any difference on our part (but I may be missing something @lavalamp @mikedanese). To make your setup crystal clear: you're running 5 nodes for etcd 1 node for master (API server etc.), right? We're generally not running distributed etcd now, so it may be some trait of Raft. Note that I'm shooting in the dark here. |
I don't think it's related to watch, because there is no difference between create & delete "event" from the watch perspective... |
We now do "graceful" deletion by default-- I'm not sure, but that may add some multiplier for etcd reads/writes. @smarterclayton ? |
Yeah it adds 2-3 writes in the happy path.
Read object (instead of previous blind delete)
Write graceful delete interval, timestamp
Kubelet writes status change (container shutdown)
Kubelet invokes hard delete
Depending on how Kubelet shuts down it can do multiple status writes.
… We now do "graceful" deletion by default-- I'm not sure, but that may add some multiplier for etcd reads/writes.
|
Also, if you create and delete pods rapidly, you may incur a high level of conflict amplification (since initial pod create is followed by several writes from scheduler and Kubelet). Multiple clients are writing and retrying in that window. |
Thanks, to me that sounds sufficient to explain the reported difference between pod creates and deletes. |
@smarterclayton @lavalamp |
Right. So the maximum safe deletion rate may be slower than the maximum On Mon, Dec 7, 2015 at 6:42 PM, wangbo [email protected] wrote:
|
@lavalamp @smarterclayton @xiang90 |
It's a good question. I don't think it's worth the complexity cost at the On Thu, Dec 10, 2015 at 9:57 PM, wangbo [email protected] wrote:
|
@yujuhong Is this related to what we discussed before about the graceful deletion implementation on Kubelet side? |
@Random-Liu, there are suboptimal logic in kubelet's status update that could cause extra GETs after a pod has been deleted, leading to even slower batch deletion. However, I don't think it'd affect the etcd memory usage. |
@lavalamp @smarterclayton @hongchaodeng @xiang90 |
@lavalamp @smarterclayton @hongchaodeng @xiang90 @HardySimpson |
Thanks for the charts. Judging from the stats, it looks like deletes are On Wed, Jan 6, 2016 at 5:06 AM, wangbo [email protected] wrote:
|
@magicwang-cn Can you also get the store operation (reads, writes, delete, etc..) rate vs time graph? Then we can probably know which operation causes the spike usage. |
And it seems the time scale of the graphs are not the same. |
OK, i will do this as soon as possible
no, they are the data for the same period
|
@xiang90 @lavalamp @hongchaodeng @smarterclayton @HardySimpson |
@magicwang-cn It seems like the 1000 concurrent deletes was killing etcd. Can you find a way to rate limit it and give it another try? Or can you try to give more memory to etcd to see if it survives? |
@xiang90 |
@magicwang-cn If you look at the graph you generated, the delete rate is apparently much higher than 20 or 10. It is at the order of 100. |
@xiang90 no, just about 10/s , you can see as follows:
|
@magicwang-cn I still cannot see why it is 10 by the graph you just showed me. I think you showed me the wrong graph? There are action, group, instance, job on the graph but no rate. |
Memory in etcd and master server will continue to be optimized through protobuf over the next few releases, so we'll be reducing a significant amount of I/O and garbage. Etcd3 should also reduce total memory use. |
@hongchaodeng @magicwang-cn you should see the error in the logs, is the log data available? Also if your theory is correct, we should be able to repro with a UT on cache layer. |
@smarterclayton already in the plan? @timothysc the log is ( |
It'll be 1.3 at this point, merely experimental until then
|
If a watch returns an error, that means changes have been made since the client listed. If the client needs to have a 100% correct state, it must then relist. Otherwise, it may never return to a correct state (imagine one of the missed changes was a deletion). A flood of lists is quite possibly a cause of the problem, but it's not the ultimate cause. The question is why did the previous watch close? Whatever causes watches to close prematurely is the thing that needs to be fixed. |
@lavalamp
There are three identities:
Here's detailed error analysis:
|
FYI, I'm trying to reproduce it in UT. |
@hongchaodeng I was responding to your analysis. There's two cases. a) a list has just completed and we're attempting to establish a watch. The situation you're outlining is a potential problem for case a), when we're trying to get a watch started. But that means we've just done a list for some reason-- why did we need to do that list at all? My claim is that something must have gone wrong in steady state, b). |
I talked with @lavalamp today w.r.t. this issue. Let me give a summary:
|
At the mean time, I could verify:
I still think that the outdated version of watch leads to such problematic unsteady state. Unfortunately I couldn't find where the client logic. I will continue to research on potential clients and have whole-story analysis of client logic as well. |
List of pods should be served out of the cache. The line you linked to is On Thu, Jan 21, 2016 at 5:33 PM, Hongchao Deng [email protected]
|
@magicwang-cn I'm sorry, can you rephrase? I don't understand |
@lavalamp i mean if the resource version is setted, even though the watch() failed with too old resource version, the list() still reads data from cache so anywhere else case the problem? maybe here? |
@magicwang-cn I still don't think I understand what you're saying. Note that the list handler returns a different resource version than it gets-- if you send "0", it sends back The charts with a huge spike in lists tells me that there was a correlated watch failure across multiple components. The lists themselves are a symptom of the underlying cause. |
@magicwang-cn what objects are you failing on in the cache? Is it consistent? xref: #19970 |
@lavalamp @hongchaodeng @timothysc and i have printed the log that behind this line, find there are still many clients use List() request without set resourceVersion. so what's the |
It would be useful to log the client there so we can get a list of which I think only clients on every node really matter-- I don't think the e2e On Tue, Jan 26, 2016 at 5:11 AM, wangbo [email protected] wrote:
|
1 , also I wouldn't mind a V(4) log on the line above to diagnose. |
Sorry for jumping late into discussion - I was on paternity leave for a month. Generally, I kind of agree with Daniel, that if increasing the size of cache helps, we have much more serious problem and we shouldn't fix it by changing cache sizes.
The question is exactly why "Cacher" returns error in Watch - it definitely shouldn't. |
@magicwang-cn hello, since this issue is over a year old and also etcd 3.2 is out - can we please verify if this issue is still valid? If not, can we please close it? Related issue #20540 has a comment that etcd3 should solve problems related to large cluster. Thanks! |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@wojtek-t
i have created a k8s cluster with 400 nodes, 16000pods[40 pods per node] and use a 5 nodes etcd cluster[with 8G mem, 4 cpus]. and i tested for e2e density, found an etcd problem[OOM], this is the details:
first, definitely, this issue is cased by a param of kube-controller-manager(--kube-api-qps, from 20 to 100, and this can increase the rate of pods add/delete operation)
second, also my doubt, batch creating pods[create a rc with 16000 replicas] will not occur this problem, but batch deleting pods[scale the rc with 0 replicas] does(they both use etcd watch method).
final, it might be a misuse of etcd?
details:
in the process of deploying 16000 pods will not occur this problem, and the etcd mem use just about 800~900M(total 8G), the etcd snapshort is just about 46M
when i batch deleting pods , the etcd mem use increasing straight up, then OOM.
The text was updated successfully, but these errors were encountered: