-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
identifying & clearing "zombie" workers #787
Comments
To get the list of zombie workers we do the following:
To remove them nicely: In code:
remove_zombies.py
The executing is done by doing
|
This is definitely a problem. I don't have a better way for identifying zombie workers than @WeatherGod 's suggestion. I think a good first step is to write a This should be run by the worker periodically, similar to Thoughts? |
I was also having this issue. For my case I always knew that when my worker re-started, that would mean it should remove the zombie worker it was replacing, so in worker.py I made the worker save it's job name to a file, and then while starting look for that file, load that file, and register it's old zombie worker as dead. This seems to be working well for me (but wish RQ had some way of handling this internally) Here is the code I use (this snippet contains some extra stuff specific to my project, but also contains the relevant code, in case it is helpful to anyone)
|
@mhfowler Is this code sample can be used for an app deployed on heroku as the file overwriting mentioned will be a complicated one? |
Hi. I can only confirm that I've got to kill zombies myself, too. I'm using rq==0.8. No power failures. It happens from time to time. |
If the workers exit because of a crash, Out of memry,, power outage or something not nice, I'll get a zombie. |
Before this PR was merged, zombies could also be created when a job runs for longer than allowed time and the horse is hard killed by the worker process. Zombie workers should also be cleared from the worker registry so they shouldn't appear when @marcinn's screenshot shows two zombie workers, were you running |
Yes, set to 5 secs. For each queue:
|
Ok, the fix for this would be to clean the registry for each refresh.
… On 17 Mar 2020, at 07.19, Marcin Nowak ***@***.***> wrote:
were you running rqinfo with --interval argument?
Yes, set to 5 secs.
For each queue:
bin/rqinfo -i 5 <QUEUE> -u redis://<IP>:6379/<DB>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I'm investigating zombie workers issue. I've found that zombie workers have less metadata stored in Redis key, and they're exists regardless how task ends. First two are zombie workers, third is an active worker:
I wonder why first two keys aren't expired. Maybe some function, which updates worker status, is setting these values after key expiration? |
That's a really good find. I think you're right, judging from the content of the key, it should be related to stats keeping mechanism. |
So I have found a plausible explanation for this. Worker keys' TTL are routinely extended when jobs are being performed here: https://github.com/rq/rq/blob/master/rq/worker.py#L690 . The TTL being used by default is 35 (30 5). Worker stats are updated in two locations, in handle_job_failure() and in handle_job_success(). The fact that the keys are missing many information indicates that they key has been expired from Redis before the stats are written. If a job is killed because of timeout, the worker would have 5 seconds to finish I think a quick fix for this would be to increase the TTL here to Do you mind opening a PR for this? |
@marcinn did you check if your PR solves your problem? |
Can you try changing this line to exactly this? |
Yes, I'll try. |
Thanks!
… On 1 Apr 2020, at 15.06, Marcin Nowak ***@***.***> wrote:
Yes, I'll try.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I had a power failure at work today, which caused an unclean shutdown of a machine I had that was actively running rq jobs. When power came back and I started redis back up again,
rq info
listed the workers that existed before the power failure (modulo unsaved info, which I am fine with).I am calling these workers "zombie workers". They exists only in redis. They do not have a corresponding running process, but redis-queue thinks they exist for the moment. Now, any workers that were not busy at the time of the last save will show up as "idle", and it seems like they eventually get automatically cleared out after 7 minutes (the 420 second worker_ttl, I am guessing). However, the workers that were "busy" at the time of the last save do not get automatically cleared out. This causes issues because the jobs they were holding don't get failed over to the failed queue. They just sit there.
So, I have two problems. 1) I don't have a very clean way to automatically identify these zombie processes. The best I have been able to do is to parse their name for the PID, then run
ps -p PID
and see if it comes back with a command name of "rq". If it doesn't, then I assume it is a zombie.j.request_force_stop()
or messing with the hearbeat doesn't seem to do the trick. It'll clear it out, but the job doesn't get failed over.Thoughts?
My employer is willing for me to devote a little bit of time to implement some code to help address this problem, if you are receptive to it.
The text was updated successfully, but these errors were encountered: