help request: Apisix ETCD going into Crash loop back off #11338

Lakshmi2k1 · 2024-06-06T05:56:42Z

Description

Hello,
I have deployed apisix 2.7.0 Helm chart and out of three etcd pods, two are going into crash loop back off error which affects the ingress created for other deployments.

The logs show the following details,

Master (etcd pod in running state)
"msg":"rejected stream from remote peer because it was removed","local-member-id"

Other pods (etcd pods in crash loop back off state)
"failed to publish local member to cluster through raft","local-member-id":"2c16fb63879f0d98","local-member-attributes":"{Name:apisix-etcd-1 ClientURLs:[http://apisix-etcd-1.apisix-etcd-headless.apisix.svc.cluster.local:2379/ http://apisix-etcd.apisix.svc.cluster.local:2379]}","request-path":"/0/members/2c16fb63879f0d98/attributes","publish-timeout":"7s","error":"etcdserver: request cancelled"

Currently stuck in this, let me know if anyone has faced this and has any fix for this

Environment

APISIX version (run apisix version): 2.7.0

The text was updated successfully, but these errors were encountered:

kayx23 · 2024-06-06T08:24:09Z

Do you have a strong requirement to use 2.7.0? I'm using the latest and pods are starting normally.

flearc · 2024-06-06T08:31:18Z

I think you need to try etcdctl member list first. This will help you verify if the member ID of the crashing pods matches the IDs from the etcdctl.

Lakshmi2k1 · 2024-06-07T04:51:28Z

Do you have a strong requirement to use 2.7.0? I'm using the latest and pods are starting normally.

The most recent version is 2.8.0 (released in Jun 04, 2024). So, was using one version prior to that which got released on April. May I know which version of helm chart you're using?

Lakshmi2k1 · 2024-06-07T04:54:12Z

I think you need to try etcdctl member list first. This will help you verify if the member ID of the crashing pods matches the IDs from the etcdctl.

If it's cluster's etcd then we have to login into the node and execute the commands, since here it is running as pod, not sure where to execute etcdctl commands and also as the pods are in crash loop back off, I can't even exec into the pods.

flearc · 2024-06-07T05:42:35Z

There is one running etcd pod, run etcdctl member list after exec into the pod. And check the logs of crashed etcd pods, normally there was member id it used.

BTW, I think it's more likely a etcd problem.

Lakshmi2k1 · 2024-06-07T11:26:18Z

There is one running etcd pod, run etcdctl member list after exec into the pod. And check the logs of crashed etcd pods, normally there was member id it used.

BTW, I think it's more likely a etcd problem.

Hello,
I have tried it, this is what i got

I have no name!@apisix-etcd-2:/opt/bitnami/etcd$ etcdctl member list
3ff1b5cd453a87df, started, apisix-etcd-2, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2380, http://apisix-etcd-2.apisix-etcd-headless.apisix.svc.cluster.local:2379,http://apisix-etcd.apisix.svc.cluster.local:2379, false and

this is the member id I found in the logs of crashing pods [local-member-id":"2c16fb63879f0d98"]. I also tried disabling apisix etcd and used an external etcd but it was not able to integrate with the running etcd pod. I'm trying to fix that. Share if anything you know could help.

Thilip707 · 2024-06-10T06:25:22Z

any solutions for this issue iam also facing the same issue for last 3 days

Lakshmi2k1 · 2024-06-10T06:52:45Z

any solutions for this issue iam also facing the same issue for last 3 days

I have changed the etcd version in chart.yaml to "10.1.0". Now all pods are in running state. I'm checking few things in UI to make sure everything is working fine. If you are using Helm chart for deploying apisix, try this.

Thilip707 · 2024-06-10T06:57:45Z

thanks will try and update mam

Lakshmi2k1 · 2024-06-11T11:56:04Z

Apisix is working fine after upgrading the version of etcd in chart.yaml as "10.1.0". So, closing this issue.

sudhir649 · 2024-07-15T16:23:02Z

Hi @Lakshmi2k1 Still are you facing the same issue? Need some suggestion on it. We have upgraded to 10.2.6 but still facing the same issue

BadTorro · 2024-07-25T19:13:13Z

Still having the same issue. We downloaded and added the entire chart dir, setting the etcd version in chart.yaml to "10.1.0" as suggested by @Lakshmi2k1

Are there any plans to have this fixed?

sudhir649 · 2024-07-25T19:25:01Z

Hi @BadTorro try to enable the disaster recovery cron job.

BadTorro · 2024-07-25T19:43:30Z

How do you mean? Do you have some more specifics to that? We're currently using it on a local development environment and etcd boots with 3 nodes, but 2 always keep failing. From time to time I need to shutdown the entire environment and restart it again to get it working again. Using it within https://tilt.dev/ thanks

sudhir649 · 2024-07-25T20:18:34Z

Hi @BadTorro ,

I have found as of now two solution for it.

Intermittent solution for it delete the all three pvc and restart the pod.
check the readme.md file document in bitanami/etcd folder where they have explained how to enable the disaterrecovery cronjob.
In disaster recovery there is a cronjob it will take the back of pvc, if more than (n-1)/2 pods are failing then pods will automatically come back to running status with the help of backup pvc.
I have implememted the disaster recovery in my env, now I have seen that 2 pods are still failing and try to come back in running status , logs are also changed but unfortunately they are still not able to come back in running status.
But once the third pod got failed all the three pods are automatically come back to running status with the help of backup pvc.
So extract the etcd zip folder and in values. yaml enable the cronjob and zip it and redeploy it

BadTorro · 2024-07-26T06:39:07Z

@sudhir649 thanks for the tipp, need to verify - seems like I need an nfs storage provider to get the snapshot image to work.

sudhir649 · 2024-07-26T08:05:04Z

@BadTorro yes, tilt is working on local machine so you need to deploy nfs storage class.

By the way lakshmi solution won't work for me

Lakshmi2k1 · 2024-07-27T07:06:26Z

@sudhir649 @BadTorro
Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions.
1. Won't deleting the PVC cause loss of data that apisix needs.
2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

sudhir649 · 2024-07-27T07:50:06Z

@Lakshmi2k1

for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.
In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

Lakshmi2k1 · 2024-07-29T03:49:54Z

@sudhir649
Thanks Sudhir, I'll try the same from my end.

Lakshmi2k1 · 2024-07-29T05:48:22Z

@sudhir649, We are facing one more error in apisix. We use openid-connect plugin for authentication and authorization in the ApisixPluginConfig. When we try to hit the ingress of application, it gives 431 (Header too large) error. We tried removing few headers but it was breaking UI of application, so is there a way to solve this? Have you come across similar issue before?

Thilip707 · 2024-07-29T06:31:23Z

@sudhir649, We are facing one more error in apisix. We use openid-connect plugin for authentication and authorization in the ApisixPluginConfig. When we try to hit the ingress of application, it gives 431 (Header too large) error. We tried removing few headers but it was breaking UI of application, so is there a way to solve this? Have you come across similar issue before?

did u use nginx? if u use nginx add this in nginx
client_max_body_size 2G;

Lakshmi2k1 · 2024-07-29T06:44:29Z

@Thilip707

We are using the below configuration in apisix configmap as mentioned in docs

Thilip707 · 2024-07-29T06:47:01Z

just increase client size and check it will work

BadTorro · 2024-07-30T20:41:23Z

@sudhir649 @BadTorro Deploying new version of etcd worked for me initially but whenever the node is scaled down and scaled up, one of the etcd pods is going into crashloop. But since other two etcd pods were running, it wasn't affecting the route and upstream creations. Yet, we are about to use it in production environment, so wish there is a permanent fix. After reading the solution @sudhir649 you have pointed out, I have few questions. 1. Won't deleting the PVC cause loss of data that apisix needs. 2. For second solution, I feel it's good to give a try. (n-1)/2 , in my case number of etcd replicas is 3, so according to this even if one pod is crashing, the disaster recovery cron will run and take backup of pvc. But as you mentioned when two pods were crashing there was no change, when third pod also crashed then all three pods came to running state. But in my case only one or rarely two pods crashing. If you have any inputs let me know. Thanks in advance!

Regarding to that, I managed to get it work by basically:

Deploying longhorn storage solution to the cluster
Configured rancher desktop based on this guide to have open-iscsi in place and useable
changed the storageclass in the dedicated etcd sub-chart and related values.yaml file to "longhorn"

persistence:
  enabled: true
  storageClass: "longhorn"

started everything with "tilt up"

Currently keeps on running and did not crash since.
However, we are now as well checking if the Bitnami chart runs out of the box...

Lakshmi2k1 · 2024-08-05T05:59:44Z

@Lakshmi2k1

for deleting the pvc it's depends what data are you storing into it. In my case Or generally we are storing the routes only so if I deleted it will restore again once new pvc is created.

In the documentaion they have mentioned more than (n-1) /2 .It means when more than 1 pod (atleast 2 if you have 3 etcd pods) will fail then automatically pods will try to recover. Recently in our QA env all the pods were down so it's better to implement disater recovery.

I have enabled disaster recovery and deployed the helm chart, but this time not just etcd was crashing, the apisix pod stuck in init container, apisix ingress controller was crashing and the snapshot pod was also in error state. So, I rolled back to previous revision again after observing the pod status doesn't seem to change for a long time.

Lakshmi2k1 closed this as completed Jun 11, 2024

Lakshmi2k1 reopened this Jul 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

help request: Apisix ETCD going into Crash loop back off #11338

help request: Apisix ETCD going into Crash loop back off #11338

Lakshmi2k1 commented Jun 6, 2024

kayx23 commented Jun 6, 2024

flearc commented Jun 6, 2024

Lakshmi2k1 commented Jun 7, 2024

Lakshmi2k1 commented Jun 7, 2024

flearc commented Jun 7, 2024

Lakshmi2k1 commented Jun 7, 2024

Thilip707 commented Jun 10, 2024

Lakshmi2k1 commented Jun 10, 2024

Thilip707 commented Jun 10, 2024

Lakshmi2k1 commented Jun 11, 2024

sudhir649 commented Jul 15, 2024 •

edited

Loading

BadTorro commented Jul 25, 2024

sudhir649 commented Jul 25, 2024

BadTorro commented Jul 25, 2024 via email •

edited

Loading

sudhir649 commented Jul 25, 2024

BadTorro commented Jul 26, 2024

sudhir649 commented Jul 26, 2024 •

edited

Loading

Lakshmi2k1 commented Jul 27, 2024

sudhir649 commented Jul 27, 2024

Lakshmi2k1 commented Jul 29, 2024

Lakshmi2k1 commented Jul 29, 2024

Thilip707 commented Jul 29, 2024

Lakshmi2k1 commented Jul 29, 2024

Thilip707 commented Jul 29, 2024

BadTorro commented Jul 30, 2024 •

edited

Loading

Lakshmi2k1 commented Aug 5, 2024

help request: Apisix ETCD going into Crash loop back off #11338

help request: Apisix ETCD going into Crash loop back off #11338

Comments

Lakshmi2k1 commented Jun 6, 2024

Description

Environment

kayx23 commented Jun 6, 2024

flearc commented Jun 6, 2024

Lakshmi2k1 commented Jun 7, 2024

Lakshmi2k1 commented Jun 7, 2024

flearc commented Jun 7, 2024

Lakshmi2k1 commented Jun 7, 2024

Thilip707 commented Jun 10, 2024

Lakshmi2k1 commented Jun 10, 2024

Thilip707 commented Jun 10, 2024

Lakshmi2k1 commented Jun 11, 2024

sudhir649 commented Jul 15, 2024 • edited Loading

BadTorro commented Jul 25, 2024

sudhir649 commented Jul 25, 2024

BadTorro commented Jul 25, 2024 via email • edited Loading

sudhir649 commented Jul 25, 2024

BadTorro commented Jul 26, 2024

sudhir649 commented Jul 26, 2024 • edited Loading

Lakshmi2k1 commented Jul 27, 2024

sudhir649 commented Jul 27, 2024

Lakshmi2k1 commented Jul 29, 2024

Lakshmi2k1 commented Jul 29, 2024

Thilip707 commented Jul 29, 2024

Lakshmi2k1 commented Jul 29, 2024

Thilip707 commented Jul 29, 2024

BadTorro commented Jul 30, 2024 • edited Loading

Lakshmi2k1 commented Aug 5, 2024

sudhir649 commented Jul 15, 2024 •

edited

Loading

BadTorro commented Jul 25, 2024 via email •

edited

Loading

sudhir649 commented Jul 26, 2024 •

edited

Loading

BadTorro commented Jul 30, 2024 •

edited

Loading