Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock after SIHCHLD signal handling #3004

Open
andre-b-fernandes opened this issue Jun 1, 2023 · 1 comment
Open

Deadlock after SIHCHLD signal handling #3004

andre-b-fernandes opened this issue Jun 1, 2023 · 1 comment
Assignees

Comments

@andre-b-fernandes
Copy link

andre-b-fernandes commented Jun 1, 2023

Description

Environment

Python 3.9
Gunicorn 20.1.0

Current state

I have an application running with Gunicorn with 5 gthread workers.
I'm using a framework that in turn calls the Arbiter.run function. Before calling that function I start a separate thread which will sleep for a period of time and perform some logic before signalling the process (os.getpid()) with SIGHUP.
I implemented this in two different ways (both have the same issue which I'm describing in a section below).

  1. Using an infinite while True loop.
  2. Removing the loop and starting the thread in the beginning (pre-fork) and on each `on_reload' server hook.

I'm also using the child_exit and worker_exit server hooks which contain a log statement
in each function.

Behavior

We've had this issue in 2 different scenarios.

  1. When using the max-requests and max-requests-jitter configurations - After some successful autorestarts, at a random point in time, workers die after serving the max number of requests and no new workers are booted up.
  2. Removing the previous configurations - After some successful reloads, at a random point in time I see no progress being made by my separate thread and no new reloads being made.

What I've noticed is that there is some weird behavior going on with the reap_workers function whenever a SIGCHLD needs to be handled.
In that function we're looping over dead child process ids and later on cfg.child_exit is called which means my log statement should be printed each time. What I find weirdly suspicious is that each time the reload stops happening, the previous reload is not printing my log statement in cfg.child_exit for each dead child process id (for some process ids that log is missing). However I see the log statement in cfg.worker_exit which is called everytime (for all process ids) without failure, which indicates that that process indeed terminated.

Steps to reproduce

  1. Create a script which calls the Arbiter.run function.
  2. In the same process that calls that function create and start a second thread. The thread must be used to periodically reload the arbiter.
  3. Wait some time and check if at a random point in time, there is a "deadlock/livelock" situation where no new reloads are being made and no progress is being made by the arbiter.

You need to preload the app.

@andre-b-fernandes
Copy link
Author

I've also noticed that I don't get the same error if I have only 1 worker or if I add a sleep statement in the manage_workers function before killing a worker (when the arbiter kills the excess older workers)

@benoitc benoitc self-assigned this Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants