WebVideoTranscodeJob is keeping database connections open for several minutes on s4 master
Open, LowPublic
Actions

Assigned To

None

Authored By

	• Springle
	Apr 30 2015, 1:35 PM

Description

Several times today transactions from videoscalers tmh100[12] have piled up on db1040. They apparently run this sort of query:

INSERT /* WebVideoTranscodeJob::run */ IGNORE INTO transcode (transcode_image_name, transcode_key, transcode_time_startwork, transcode_error, transcode_final_bitrate) VALUES ('Night_of_the_Living_Dead_(1968_film).webm', '160p.ogv', '20150430131138', '', '0')

Then hang around blocking stuff without committing until icinga alerts go off and they risk getting killed. Do we expect to hold locks for the duration of the transcoding? Is this a recent change? If not, why it this suddenly happening now?

EDIT: Transactions do not seem to keep being open right now, but connections does, until they timeout on the database.

Details

	Subject	Repo	Branch	Lines /-
	Shut down database connections during transcode shell-outs	mediawiki/extensions/TimedMediaHandler	master	14 -1
	Add LBFactory::closeAll() method	mediawiki/core	master	9 -0

Customize query in gerrit

Related Objects

Mentioned In: T305016: Think about rdbms reconnection logic
T188260: Long-open transactions on zhwiki
T134116: Video transcoding is leaving opening large transactions on s4-master (commons)
T127428: Very high number of transcode failure on Commons
Mentioned Here: T134116: Video transcoding is leaving opening large transactions on s4-master (commons)
T127428: Very high number of transcode failure on Commons

Event Timeline

• Springle created this task.Apr 30 2015, 1:35 PM

• Springle raised the priority of this task from to Needs Triage.

• Springle updated the task description. (Show Details)

• Springle added a project: MediaWiki-libs-Rdbms.

• Springle subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 30 2015, 1:35 PM

This is recurring.

several long-running jobs on tmh100*, perhaps related

www-data 23579  159  1.2 986544 202352 ?       RNl  13:21  36:06  |                       \_ /usr/bin/avconv -y -i /tmp/localcopy_bc66c506e711-1.webm -threads
www-data 18401  153  1.2 1000048 210064 ?      RNl  13:05  59:49  |                       \_ /usr/bin/avconv -y -i /tmp/localcopy_1b094e5bdce1-1.webm -threads
www-data 18366 78.7  0.5 219024 92480 ?        RN   13:05  30:58  |                       \_ /usr/bin/ffmpeg2theora /tmp/localcopy_0248b2397fe5-1.webm -V 512 -
www-data 18252  150  1.4 1041916 241184 ?      RNl  13:04  60:42  |                       \_ /usr/bin/avconv -y -i /tmp/localcopy_90771f9f8632-1.webm -threads
www-data 29231 94.1  0.5 210448 86572 ?        RN   13:40   4:02                          \_ /usr/bin/ffmpeg2theora /tmp/localcopy_b9fb9c780cd6-1.webm -V 160 -

This is recurring.

Seven weeks later: Still occurring? Still high priority? Any news?

Ten weeks later: Still occurring? Still high priority? Any news?

Three months later: Anyone who could check on mw1040 if this is still occurring? Still "high priority"?

Five months later: Anyone who could check on mw1040 if this is still occurring? Still "high priority"?

hoo added subscribers: hoo, • jcrespo, aaron.Nov 14 2015, 5:14 PM

In T97641#1805694, @Aklapper wrote:

Five months later: Anyone who could check on mw1040 if this is still occurring? Still "high priority"?

I presume you mean db1040? :)

There's no sign of any long running queries like that on db1040, but that's not to say the bug is fixed etc

Reedy added a project: TimedMediaHandler-Transcode.Nov 14 2015, 5:21 PM

Reedy set Security to None.

I would say this would merit investigation by people working with media: https://logstash.wikimedia.org/#dashboard/temp/AVEPl-AqptxhN1XagGNY

TheDJ mentioned this in T127428: Very high number of transcode failure on Commons.Feb 19 2016, 10:44 AM

Related is T127428. Seems the job runner RPC service wrap the whole transcoding session in a database transaction. So while the transcoding happens there is an ongoing idle transaction on the database.

• brooke mentioned this in T134116: Video transcoding is leaving opening large transactions on s4-master (commons).May 1 2016, 4:03 PM

https://github.com/wikimedia/mediawiki-extensions-TimedMediaHandler/blob/master/WebVideoTranscode/WebVideoTranscodeJob.php#L180 looks like it's trying to close out transactions. Is this right or wrong way to do it?

I would like to bump the priority of this for the reasons mentioned on T134116.

We workarounded the 3rd party connection issues by tweaking the pool of connections handling. But the long running transaction until they timout (default: 3600 seconds) are still ongoing.

Note: no ongoing queries this time, only Idle (open transactions?) connections.

Two questions:

First, is the current code closing out the transactions correctly?

Second, what are we supposed to do, as a general rule, during long running processes that do not touch the database? Can we close out the connections are reopen them later?

Couple notes:

https://commons.wikimedia.org/wiki/Special:TimedMediaHandler Shows what TimedMediaHandler thinks is currently running. There are a lot of new files including very long speeches / talk / conference videos which take forever to transcode. This is backing up the queue right now by over an hour.

However the queues being full ought not to result in unreasonable load. If they are we may have to tune how many processes run and how many servers are available.

We also recently added two new low resolution webm transcode targets, but there are no automatic jobs re-running them on old files so that should not cause a major backup itself.

Can we close out the connections are reopen them later?

No "can", must. For several reasons:

A long idle open connection takes resources (mostly, memor) from other connections that may be doing something useful. It is in particular harmful with the tread-pool we have deployed on our servers
Under certain cases, it can lead to max connection exhaustion
The server will automatically close long running connections anyway
The application must be prepared for connections to fail at any time (e.g. on failover)
The application must reread its database configuration regularly (e.g. if a server is depooled)

Not closing idle connections would mean to commit to a server being up for hours. Servers being available is not a guarantee, only the service is guaranteed.

In the MySQL model, opening new connections is not costly. Do not do it on every single query, but unless very fast queries are going to be sent one after another, connections should be closed as soon as mysql is no longer needed (even before the request is served). That increases the amount of work mysql can do in parallel.

If at some point connections would get expensive (e.g. TLS deployment), that should be transparent to the application by setting a higher-level proxy of persistent connections. The application should not try to optimize that at all.

As far as I can see, no ongoing transactions (open transactions), only open connections:

| 19982549 | wikiuser        | 10.64.16.146:46469 | commonswiki        | Sleep       |    1166 |
...
History list length 256
...
---TRANSACTION 6084148123, not started
MySQL thread id 19982549, OS thread handle 0x7faddbc83700, query id 1263078851 10.64.16.146 wikiuser cleaning up

So the bug was fixed, partially (transactions seem to finish, but they do not disconnect).

Ok so follow up: "in MediaWiki how do we close out all db connections in a reliable way that allows them to open again later?" I've never had this particular use case so will probably have to dive into the lb classes . :)

Found LoadBalancer::closeAll() which looks handy. When I'm back on the ground I'll test to confirm whether the connections reopen on demand.

Change 286381 had a related patch set uploaded (by Brion VIBBER):
Shut down database connections during transcode shell-outs

https://gerrit.wikimedia.org/r/286381

gerritbot added a project: Patch-For-Review.May 2 2016, 2:54 AM

So looking on the scalers, what I found out is what follows:

there are several jobs running in hhvm, more than the number that should be enqueued by the jobrunner process (which would be 5)
serveral jobs fail to be transcoded within the 6 hours limit we impose

Some more data:

HHVM sees 18 threads busy
that is also the number of enconding jobs I see
apache on the other hand sees just one busy worker
All connection to the public apache port (9005) are in TIME_WAIT status
requests go to http://127.0.0.1:9005/rpc/RunJobs.php?wiki=commonswiki&type=webVideoTranscode&maxtime=60&maxmem=300M

so it seems like POSTs get responded before the jobs are done and that the number of running jobs can thus greatly overflow the programmed number.

hashar unsubscribed.May 2 2016, 9:39 AM

maxtime=60 seems like a problem. What enforces that?

The application must reread its database configuration regularly (e.g. if a server is depooled)

So for that:

Shut down database connections during transcode shell-outs
https://gerrit.wikimedia.org/r/286381

to afterwards be able to then reconnect it would need to make a new "request"?

• jcrespo renamed this task from transcode transaction blocking for minutes on s4 master to transcode connections open for minutes on s4 master.May 3 2016, 6:59 AM

• jcrespo updated the task description. (Show Details)

Change 290478 had a related patch set uploaded (by Brion VIBBER):
Add LBFactory::closeAll() method

https://gerrit.wikimedia.org/r/290478

Change 290478 merged by Aaron Schulz:
Add LBFactory::closeAll() method

https://gerrit.wikimedia.org/r/290478

ReleaseTaggerBot added projects: MW-1.28-release (WMF-deploy-2016-06-07_(1.28.0-wmf.5)), MW-1.28-release-notes.May 31 2016, 9:01 PM

hashar removed a project: Patch-For-Review.Sep 2 2016, 8:36 AM

So do we know if this is any better now ? Can this be close ?

As far as I can tell, @TheDJ, only support for CloseAll() has been added, now it has to be used by the client code I assume?

This should improve the number of long running connections, mainly to s4-master: purge lag and processlist:
https://grafana.wikimedia.org/dashboard/db/mysql?var-dc=eqiad prometheus/ops&var-server=db1040&from=now-24h&to=now

https://gerrit.wikimedia.org/r/#/c/286381/5 updated. Anyone care to poke it?

Sorry, I can only say thank you for resurecting this! @brion

• jcrespo awarded a token.Nov 2 2016, 5:23 PM

No need for sorries! The patch resurrection is upon us, let us rejoice and merge. :D

[Note I had to make a slight update for backend changes that made it different how one gets to the LBFactory singleton. Confirmed that it works in the Vagrant setup, at least in isolation.]

Change 286381 merged by jenkins-bot:
Shut down database connections during transcode shell-outs

https://gerrit.wikimedia.org/r/286381

ReleaseTaggerBot added a project: MW-1.29-release (WMF-deploy-2016-11-08_(1.29.0-wmf.2)).Nov 2 2016, 10:00 PM

@Springle @jcrespo Is this resolved?

I do not know if it is fixed, right now we just kill sleeping connections after 60 seconds.

• jcrespo closed this task as Resolved.May 25 2017, 2:28 PM

• jcrespo claimed this task.

• jcrespo reassigned this task from • jcrespo to • brooke.

I checked and this still happens (connections being idle for a long time) - not 100% sure, but I would suspect for the same initial reason. It is not impacting production because we will kill, connections as I said above.

• jcrespo removed • brooke as the assignee of this task.May 25 2017, 3:10 PM

• jcrespo mentioned this in T188260: Long-open transactions on zhwiki.Feb 26 2018, 3:56 PM

Krinkle renamed this task from transcode connections open for minutes on s4 master to WebVideoTranscodeJob is keeping database connections open for several minutes on s4 master.Jul 29 2018, 1:42 AM

Krinkle moved this task from Untriaged to Usage problem on the MediaWiki-libs-Rdbms board.

Krinkle removed a project: MW-1.28-release-notes.

Krinkle removed a project: MediaWiki-libs-Rdbms.Feb 9 2019, 11:39 PM

TheDJ moved this task from Backlog to Job management on the TimedMediaHandler-Transcode board.Jan 13 2022, 12:46 PM

• jcrespo mentioned this in T305016: Think about rdbms reconnection logic.Apr 20 2022, 1:59 PM

@jcrespo and @Ladsgroup does any of this still occur ? There's been several changes to the job since that 2017 state, but i have no idea if this is still a thing.

There are now sections like this, before the transcoding and before the saving to the filestore, which are the two major long processing phases of the job.

			// Avoid contention and "server has gone away" errors as
			// the transcode will take a very long time in some cases
			$this->lbFactory->commitPrimaryChanges( __METHOD__ );
			$this->lbFactory->flushPrimarySessions( __METHOD__ );
			$this->lbFactory->flushReplicaSnapshots( __METHOD__ );
			// We can't just leave the connection open either or it will
			// eat up resources and block new connections, so make sure
			// everything is dead and gone.
			$this->lbFactory->closeAll();

WebVideoTranscodeJob is keeping database connections open for several minutes on s4 masterOpen, LowPublicActions

Description

Details

Related Objects

Event Timeline

WebVideoTranscodeJob is keeping database connections open for several minutes on s4 master
Open, LowPublic
Actions