Page MenuHomePhabricator

QueryService updater blocked
Closed, ResolvedPublic

Description

After seeing a message from a PM we investigated and saw a load of errors in the update log.

Seems to all be related to QS batches from Wiki framenet-akkadian257

Event Timeline

We plan to manually mark as failed all batches from this Wikis so they are no longer blocking the other wikis from updating

Manually failed the batches using:

> Wiki::where(['id' => 1023])->first()
= App\Wiki {#60508
    id: 1023,
    domain: "framenet-akkadian257.wikibase.cloud",
    sitename: "FrameNet Akkadian 257",
    deleted_at: null,
    created_at: "2024-03-26 04:19:43",
    updated_at: "2024-03-26 04:19:43",
    description: null,
    is_featured: 0,
     domain_decoded: "framenet-akkadian257.wikibase.cloud",
  }

> QsBatch::where(['wiki_id' => 1023, 'done' => 0])->update(['failed' => 1])
= 22339

Could also be that the queryservice itself is straining under load. It seems to be out of memory

image.png (408×1 px, 33 KB)
We got muddled with staging

Log entries there look like:

Queryservice updater was scaled down to 0 at 11:08 Berlin time

We saw a lot of uptime checks in the logs (obviously) and we also noticed that we started being charged (!) for these around the time the incident happened so we wondered if this was part of the issue. We manually changed the path of the uptime jobs in Google Cloud Dashboard to target something that will 404 in the hope that this solved our issue.

It did not

We killed the one QS pod and saw it's CPU return to 0. We let it soak for 5 mins and then rescaled back up the QS updater. Things seems to be happy and we'll allow the batches to catch up.

Once this is done our next step is to enable all the batches we marked as failed from framenet-akkadian257

It seems some of the pending batches belonged to wikis that have since been deleted so I manually failed these to unclog the queue

QsBatch::doesntHave('wiki')->where(['done' => 0, 'failed' => 0])->update(['failed' => 1])

I partially undid the manual failing for the potentially offending wiki next

> QsBatch::where(['wiki_id' => 1023, 'done' => 0, 'failed' => 1])->limit(500)->update(['failed' => 0, 'processing_attempts' => 0])
= 500

Seeing that processing these items made the Queryservice CPU max out again I decided to "re-fail" the offending items in order to restore the service

> QsBatch::where(['wiki_id' => 1023, 'done' => 0, 'failed' => 0])->update(['failed' => 1])
= 499

The service is not working perfectly: qualifiers and references aren't loaded in the query service, as well as statements not having best rank.

As a check: using https://hypotheseis.wikibase.cloud/query/#PREFIX hd: DESCRIBE hd:Q298 you get 21 results among which there are no references or qualifiers; however, https://hypotheseis.wikibase.cloud/wiki/Item:Q112 contains 1 qualifier and 2 references. Moreover, the item contains 1 property with two values, one in preferred rank and one in normal rank, but in the query only the value in preferred rank is shown, the other is completely ignored.

07:38:27.936 [main] INFO  o.w.query.rdf.tool.WbStackUpdate - Starting Updater 0.3.84 (3bcfddc0f526c9a227773e68fe29419343d33993) --wikibaseHost asdfasdf.wbaas.localhost --ids Q1,Q2 --entityNamespaces 120,122,146 --sparqlUrl http://queryservice.default.svc.cluster.local:9999/bigdata/namespace/qsns_b61b483051/sparql --wikibaseScheme http --conceptUri https://asdfasdf.wbaas.localhost
07:38:59.157 [main] INFO  o.w.q.r.t.change.ChangeSourceContext - Checking where we left off
07:38:59.158 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the updater
07:39:04.159 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the dump
07:39:05.648 [main] INFO  o.w.q.r.t.change.ChangeSourceContext - Defaulting start time to 30 days ago: 2024-03-05T07:39:05.648Z
07:39:23.157 [main] INFO  o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.io.EOFException: HttpConnectionOverHTTP@529a95d::SocketChannelEndPoint@45110cb4{queryservice.default.svc.cluster.local/10.102.88.47:9999<->/10.244.0.123:37970,ISHUT,fill=-,flush=-,to=7639/0}{io=0/0,kio=0,kro=1}->HttpConnectionOverHTTP@529a95d(l:/10.244.0.123:37970 <-> r:queryservice.default.svc.cluster.local/10.102.88.47:9999,closed=false)=>HttpChannelOverHTTP@77241e03(exchange=HttpExchange@5d1f6df8 req=TERMINATED/null@null res=PENDING/null@null)[send=HttpSenderOverHTTP@221f326(req=QUEUED,snd=COMPLETED,failure=null)[HttpGenerator@4f983b7a{s=START}],recv=HttpReceiverOverHTTP@189be08c(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]], attempt 1, will retry
07:39:25.159 [main] INFO  o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 2, will retry
07:39:29.161 [main] INFO  o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 3, will retry
07:39:37.166 [main] INFO  o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 4, will retry
07:39:47.167 [main] INFO  o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 5, will retry
07:40:01.558 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 1 at (0.0, 0.0, 0.0) updates per second and (0.0, 0.0, 0.0) ids per second

Trying to upload this following lexeme to my local wiki (using the API sandbox) resulted in the one off log shown above:

{"type":"lexeme","lemmas":{"mis":{"language":"mis","value":"narạ̄ṭu"}},"lexicalCategory":"Q1","language":"Q2","claims":{},"forms":[],"senses":[]}

where Q1 and Q2 are items I manually made for this purpose.

Trying to upload this following lexeme to my local wiki (using the API sandbox) resulted in the one off log shown above

Did this also cause a CPU spike in the query service?

This is now done since we seem to have a functional QS for most wikis. T361436 is where work will continue on investigating the unusual behaviour of this specific wikibase

FWIW, in Telegram we have had a report that federated queries to Wikidata are no longer working.

I suspect it is a case similar to Suite as mentioned in T355893#9660154, specifically whitelist.txt is being copied to the WDQS container but the appropriate settings have not been added to use it rather than the new default name allowlist.txt.
{F47006977}

@GreenReaper Thanks for the actionable report. I did indeed miss this change when updating the base image. PR with a fix is found here: https://github.com/wbstack/queryservice/pull/111

An update of the Queryservice using allowlist.txt is now running in production.

Thanks for the quick fix for federation; one of the users reporting the issue says it is all good now, if anyone else still has issues I'll suggest they file a separate issue. 👍