Deadlock after SIHCHLD signal handling #3004

andre-b-fernandes · 2023-06-01T11:26:02Z

Description

Environment

Python 3.9
Gunicorn 20.1.0

Current state

I have an application running with Gunicorn with 5 gthread workers.
I'm using a framework that in turn calls the Arbiter.run function. Before calling that function I start a separate thread which will sleep for a period of time and perform some logic before signalling the process (os.getpid()) with SIGHUP.
I implemented this in two different ways (both have the same issue which I'm describing in a section below).

Using an infinite while True loop.
Removing the loop and starting the thread in the beginning (pre-fork) and on each `on_reload' server hook.

I'm also using the child_exit and worker_exit server hooks which contain a log statement
in each function.

Behavior

We've had this issue in 2 different scenarios.

When using the max-requests and max-requests-jitter configurations - After some successful autorestarts, at a random point in time, workers die after serving the max number of requests and no new workers are booted up.
Removing the previous configurations - After some successful reloads, at a random point in time I see no progress being made by my separate thread and no new reloads being made.

What I've noticed is that there is some weird behavior going on with the reap_workers function whenever a SIGCHLD needs to be handled.
In that function we're looping over dead child process ids and later on cfg.child_exit is called which means my log statement should be printed each time. What I find weirdly suspicious is that each time the reload stops happening, the previous reload is not printing my log statement in cfg.child_exit for each dead child process id (for some process ids that log is missing). However I see the log statement in cfg.worker_exit which is called everytime (for all process ids) without failure, which indicates that that process indeed terminated.

Steps to reproduce

Create a script which calls the Arbiter.run function.
In the same process that calls that function create and start a second thread. The thread must be used to periodically reload the arbiter.
Wait some time and check if at a random point in time, there is a "deadlock/livelock" situation where no new reloads are being made and no progress is being made by the arbiter.

You need to preload the app.

The text was updated successfully, but these errors were encountered:

andre-b-fernandes · 2023-06-01T20:10:03Z

I've also noticed that I don't get the same error if I have only 1 worker or if I add a sleep statement in the manage_workers function before killing a worker (when the arbiter kills the excess older workers)

benoitc self-assigned this Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock after SIHCHLD signal handling #3004

Deadlock after SIHCHLD signal handling #3004

andre-b-fernandes commented Jun 1, 2023 •

edited

Loading

andre-b-fernandes commented Jun 1, 2023

Deadlock after SIHCHLD signal handling #3004

Deadlock after SIHCHLD signal handling #3004

Comments

andre-b-fernandes commented Jun 1, 2023 • edited Loading

Description

Environment

Current state

Behavior

Steps to reproduce

andre-b-fernandes commented Jun 1, 2023

andre-b-fernandes commented Jun 1, 2023 •

edited

Loading