Page MenuHomePhabricator

WebVideoTranscodeJob is keeping database connections open for several minutes on s4 master
Open, LowPublic

Description

Several times today transactions from videoscalers tmh100[12] have piled up on db1040. They apparently run this sort of query:

INSERT /* WebVideoTranscodeJob::run */ IGNORE INTO transcode (transcode_image_name, transcode_key, transcode_time_startwork, transcode_error, transcode_final_bitrate) VALUES ('Night_of_the_Living_Dead_(1968_film).webm', '160p.ogv', '20150430131138', '', '0')

Then hang around blocking stuff without committing until icinga alerts go off and they risk getting killed. Do we expect to hold locks for the duration of the transcoding? Is this a recent change? If not, why it this suddenly happening now?

EDIT: Transactions do not seem to keep being open right now, but connections does, until they timeout on the database.

Event Timeline

Springle raised the priority of this task from to Needs Triage.
Springle updated the task description. (Show Details)
Springle subscribed.

This is recurring.

several long-running jobs on tmh100*, perhaps related

www-data 23579  159  1.2 986544 202352 ?       RNl  13:21  36:06  |                       \_ /usr/bin/avconv -y -i /tmp/localcopy_bc66c506e711-1.webm -threads
www-data 18401  153  1.2 1000048 210064 ?      RNl  13:05  59:49  |                       \_ /usr/bin/avconv -y -i /tmp/localcopy_1b094e5bdce1-1.webm -threads
www-data 18366 78.7  0.5 219024 92480 ?        RN   13:05  30:58  |                       \_ /usr/bin/ffmpeg2theora /tmp/localcopy_0248b2397fe5-1.webm -V 512 -
www-data 18252  150  1.4 1041916 241184 ?      RNl  13:04  60:42  |                       \_ /usr/bin/avconv -y -i /tmp/localcopy_90771f9f8632-1.webm -threads
www-data 29231 94.1  0.5 210448 86572 ?        RN   13:40   4:02                          \_ /usr/bin/ffmpeg2theora /tmp/localcopy_b9fb9c780cd6-1.webm -V 160 -

This is recurring.

Seven weeks later: Still occurring? Still high priority? Any news?

Ten weeks later: Still occurring? Still high priority? Any news?

Three months later: Anyone who could check on mw1040 if this is still occurring? Still "high priority"?

Five months later: Anyone who could check on mw1040 if this is still occurring? Still "high priority"?

Five months later: Anyone who could check on mw1040 if this is still occurring? Still "high priority"?

I presume you mean db1040? :)

There's no sign of any long running queries like that on db1040, but that's not to say the bug is fixed etc

Related is T127428. Seems the job runner RPC service wrap the whole transcoding session in a database transaction. So while the transcoding happens there is an ongoing idle transaction on the database.

We workarounded the 3rd party connection issues by tweaking the pool of connections handling. But the long running transaction until they timout (default: 3600 seconds) are still ongoing.

Note: no ongoing queries this time, only Idle (open transactions?) connections.

Two questions:

First, is the current code closing out the transactions correctly?

Second, what are we supposed to do, as a general rule, during long running processes that do not touch the database? Can we close out the connections are reopen them later?

Couple notes:

https://commons.wikimedia.org/wiki/Special:TimedMediaHandler Shows what TimedMediaHandler thinks is currently running. There are a lot of new files including very long speeches / talk / conference videos which take forever to transcode. This is backing up the queue right now by over an hour.

However the queues being full ought not to result in unreasonable load. If they are we may have to tune how many processes run and how many servers are available.

We also recently added two new low resolution webm transcode targets, but there are no automatic jobs re-running them on old files so that should not cause a major backup itself.

Can we close out the connections are reopen them later?

No "can", must. For several reasons:

  • A long idle open connection takes resources (mostly, memor) from other connections that may be doing something useful. It is in particular harmful with the tread-pool we have deployed on our servers
  • Under certain cases, it can lead to max connection exhaustion
  • The server will automatically close long running connections anyway
  • The application must be prepared for connections to fail at any time (e.g. on failover)
  • The application must reread its database configuration regularly (e.g. if a server is depooled)

Not closing idle connections would mean to commit to a server being up for hours. Servers being available is not a guarantee, only the service is guaranteed.

In the MySQL model, opening new connections is not costly. Do not do it on every single query, but unless very fast queries are going to be sent one after another, connections should be closed as soon as mysql is no longer needed (even before the request is served). That increases the amount of work mysql can do in parallel.

If at some point connections would get expensive (e.g. TLS deployment), that should be transparent to the application by setting a higher-level proxy of persistent connections. The application should not try to optimize that at all.

As far as I can see, no ongoing transactions (open transactions), only open connections:

| 19982549 | wikiuser        | 10.64.16.146:46469 | commonswiki        | Sleep       |    1166 |
...
History list length 256
...
---TRANSACTION 6084148123, not started
MySQL thread id 19982549, OS thread handle 0x7faddbc83700, query id 1263078851 10.64.16.146 wikiuser cleaning up

So the bug was fixed, partially (transactions seem to finish, but they do not disconnect).

Ok so follow up: "in MediaWiki how do we close out all db connections in a reliable way that allows them to open again later?" I've never had this particular use case so will probably have to dive into the lb classes . :)

Found LoadBalancer::closeAll() which looks handy. When I'm back on the ground I'll test to confirm whether the connections reopen on demand.

Change 286381 had a related patch set uploaded (by Brion VIBBER):
Shut down database connections during transcode shell-outs

https://gerrit.wikimedia.org/r/286381

So looking on the scalers, what I found out is what follows:

  • there are several jobs running in hhvm, more than the number that should be enqueued by the jobrunner process (which would be 5)
  • serveral jobs fail to be transcoded within the 6 hours limit we impose

Some more data:

so it seems like POSTs get responded before the jobs are done and that the number of running jobs can thus greatly overflow the programmed number.

maxtime=60 seems like a problem. What enforces that?

The application must reread its database configuration regularly (e.g. if a server is depooled)

So for that:

Shut down database connections during transcode shell-outs
https://gerrit.wikimedia.org/r/286381

to afterwards be able to then reconnect it would need to make a new "request"?

jcrespo renamed this task from transcode transaction blocking for minutes on s4 master to transcode connections open for minutes on s4 master.May 3 2016, 6:59 AM
jcrespo updated the task description. (Show Details)

Change 290478 had a related patch set uploaded (by Brion VIBBER):
Add LBFactory::closeAll() method

https://gerrit.wikimedia.org/r/290478

Change 290478 merged by Aaron Schulz:
Add LBFactory::closeAll() method

https://gerrit.wikimedia.org/r/290478

So do we know if this is any better now ? Can this be close ?

As far as I can tell, @TheDJ, only support for CloseAll() has been added, now it has to be used by the client code I assume?

This should improve the number of long running connections, mainly to s4-master: purge lag and processlist:
https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad prometheus/ops&var-server=db1040&from=now-24h&to=now

Sorry, I can only say thank you for resurecting this! @brion

No need for sorries! The patch resurrection is upon us, let us rejoice and merge. :D

[Note I had to make a slight update for backend changes that made it different how one gets to the LBFactory singleton. Confirmed that it works in the Vagrant setup, at least in isolation.]

Change 286381 merged by jenkins-bot:
Shut down database connections during transcode shell-outs

https://gerrit.wikimedia.org/r/286381

I do not know if it is fixed, right now we just kill sleeping connections after 60 seconds.

jcrespo claimed this task.
jcrespo reassigned this task from jcrespo to brooke.
jcrespo lowered the priority of this task from High to Low.

I checked and this still happens (connections being idle for a long time) - not 100% sure, but I would suspect for the same initial reason. It is not impacting production because we will kill, connections as I said above.

Krinkle renamed this task from transcode connections open for minutes on s4 master to WebVideoTranscodeJob is keeping database connections open for several minutes on s4 master.Jul 29 2018, 1:42 AM
Krinkle moved this task from Untriaged to Usage problem on the MediaWiki-libs-Rdbms board.
Krinkle removed a project: MW-1.28-release-notes.

@jcrespo and @Ladsgroup does any of this still occur ? There's been several changes to the job since that 2017 state, but i have no idea if this is still a thing.

There are now sections like this, before the transcoding and before the saving to the filestore, which are the two major long processing phases of the job.

			// Avoid contention and "server has gone away" errors as
			// the transcode will take a very long time in some cases
			$this->lbFactory->commitPrimaryChanges( __METHOD__ );
			$this->lbFactory->flushPrimarySessions( __METHOD__ );
			$this->lbFactory->flushReplicaSnapshots( __METHOD__ );
			// We can't just leave the connection open either or it will
			// eat up resources and block new connections, so make sure
			// everything is dead and gone.
			$this->lbFactory->closeAll();