Page MenuHomePhabricator

Database upgrade MariaDB 10: 600 seconds timeout
Closed, ResolvedPublic

Description

I'm one of the maintainers of the Lists tool: http://tools.wmflabs.org/lists/

This tool executes a series of queries every day: after each query it runs another query to record some statistic information.

If the primary query runs for more than 600 seconds, the secondary one fails with the error "General error: 2006 MySQL server has gone away".

This issue has begun after the migration to MariaDB 10.


Version: unspecified
Severity: normal

Details

Reference
bz69110

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:34 AM
bzimport added a project: Cloud-VPS.
bzimport set Reference to bz69110.

I've found that the problem is related with the primary queries: some of them now are slower then before and exceeds the 600 seconds limit.

For example the query [1] runned in 170 seconds and now runs in more than 1000 seconds.

[1] http://tools.wmflabs.org/lists/itwiki/Voci/Voci_senza_uscita

Please post examples of both queries.

The query is at the bottom of the previous link. The query that fails is not important, the problem is that this query takes too long to run.

I realize you think the speed is the problem, which I agree is an issue. However there is no "over 600s" type kill mechanism, so I'm interested in establishing two things:

  1. Why the first query is slow. Thanks, I see the example now.
  1. Why the second query dies and whether it is, in fact, related to the speed of the first query, or to something else unexpected. Hence I asked to see it too...

Incola: "not important" doesn't really exist when trying to find steps to reproduce. :)

The second query is something like:

insert into executions (query_id, time, duration, results) values (23, 2014-08-05 14:01:05, 1879, 5290)

The first query is heavily dependent on disk IO. It runs in ~1000s on both MariaDB 10 and 5.5 if data is cold, or if any other concurrent query is also bottle necked on disk. This should be reviewed once the switch back to SSD is done (to be scheduled very shortly after labsdb1003 migrates).

Regarding the second query dying or losing connection, which still seems odd, it would be useful to know:

  • If the first query always completes regardless of slow runtime, or sometimes fails/is-killed itself.
  • If there is any delay between issuing the two queries on the same DB connection (seconds, minutes, etc ..).
  • If there is any transaction in use, either via explicit BEGIN or AUTO_COMMIT=0.
  • What client connector or library is used, and whether it could have any custom timeout settings.
  • The first query always runs correctly.
  • They are on different connections.
  • I don't know because I'm not the original author of the code and I don't know how works the framework that was used.

After switching back to SSD no error is reported and the queries are run with their previous timing.