QueryService updater blocked
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Tarrow
	Apr 2 2024, 8:17 AM

Description

After seeing a message from a PM we investigated and saw a load of errors in the update log.

Seems to all be related to QS batches from Wiki framenet-akkadian257

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Tarrow	T361551 QueryService updater blocked
		Resolved	BUG REPORT	Fring	T361571 Platform API returns batches where parent wiki has been deleted

Event Timeline

Tarrow created this task.Apr 2 2024, 8:17 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 2 2024, 8:17 AM

Tarrow updated the task description. (Show Details)Apr 2 2024, 8:41 AM

We plan to manually mark as failed all batches from this Wikis so they are no longer blocking the other wikis from updating

Manually failed the batches using:

> Wiki::where(['id' => 1023])->first()
= App\Wiki {#60508
    id: 1023,
    domain: "framenet-akkadian257.wikibase.cloud",
    sitename: "FrameNet Akkadian 257",
    deleted_at: null,
    created_at: "2024-03-26 04:19:43",
    updated_at: "2024-03-26 04:19:43",
    description: null,
    is_featured: 0,
     domain_decoded: "framenet-akkadian257.wikibase.cloud",
  }

> QsBatch::where(['wiki_id' => 1023, 'done' => 0])->update(['failed' => 1])
= 22339

~~Could also be that the queryservice itself is straining under load. It seems to be out of memory~~

We got muddled with staging

Log entries there look like:

Queryservice updater was scaled down to 0 at 11:08 Berlin time

We saw a lot of uptime checks in the logs (obviously) and we also noticed that we started being charged (!) for these around the time the incident happened so we wondered if this was part of the issue. We manually changed the path of the uptime jobs in Google Cloud Dashboard to target something that will 404 in the hope that this solved our issue.

It did not

We killed the one QS pod and saw it's CPU return to 0. We let it soak for 5 mins and then rescaled back up the QS updater. Things seems to be happy and we'll allow the batches to catch up.

Once this is done our next step is to enable all the batches we marked as failed from framenet-akkadian257

It seems some of the pending batches belonged to wikis that have since been deleted so I manually failed these to unclog the queue

QsBatch::doesntHave('wiki')->where(['done' => 0, 'failed' => 0])->update(['failed' => 1])

I partially undid the manual failing for the potentially offending wiki next

> QsBatch::where(['wiki_id' => 1023, 'done' => 0, 'failed' => 1])->limit(500)->update(['failed' => 0, 'processing_attempts' => 0])
= 500

Epidosis mentioned this in T361186: Wikibase Query Service on Wikibase Cloud doesn't show data edited after 27 March 2024.Apr 2 2024, 11:12 AM

Seeing that processing these items made the Queryservice CPU max out again I decided to "re-fail" the offending items in order to restore the service

> QsBatch::where(['wiki_id' => 1023, 'done' => 0, 'failed' => 0])->update(['failed' => 1])
= 499

Service should now be restored for all wikis but https://framenet-akkadian257.wikibase.cloud/

Fring moved this task from To do to Doing on the Wikibase Cloud (Kanban board Q1 2024) board.Apr 2 2024, 12:15 PM

The service is not working perfectly: qualifiers and references aren't loaded in the query service, as well as statements not having best rank.

As a check: using https://hypotheseis.wikibase.cloud/query/#PREFIX hd: DESCRIBE hd:Q298 you get 21 results among which there are no references or qualifiers; however, https://hypotheseis.wikibase.cloud/wiki/Item:Q112 contains 1 qualifier and 2 references. Moreover, the item contains 1 property with two values, one in preferred rank and one in normal rank, but in the query only the value in preferred rank is shown, the other is completely ignored.

Charlie_WMDE mentioned this in T361436: Framenet Lexeme Data makes Wikibase Queryservice behave unusually [8h investigation].Apr 3 2024, 11:20 AM

Charlie_WMDE mentioned this in T361438: Creation of new wiki doesn't work.Apr 3 2024, 11:25 AM

07:38:27.936 [main] INFO  o.w.query.rdf.tool.WbStackUpdate - Starting Updater 0.3.84 (3bcfddc0f526c9a227773e68fe29419343d33993) --wikibaseHost asdfasdf.wbaas.localhost --ids Q1,Q2 --entityNamespaces 120,122,146 --sparqlUrl http://queryservice.default.svc.cluster.local:9999/bigdata/namespace/qsns_b61b483051/sparql --wikibaseScheme http --conceptUri https://asdfasdf.wbaas.localhost
07:38:59.157 [main] INFO  o.w.q.r.t.change.ChangeSourceContext - Checking where we left off
07:38:59.158 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the updater
07:39:04.159 [main] INFO  o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the dump
07:39:05.648 [main] INFO  o.w.q.r.t.change.ChangeSourceContext - Defaulting start time to 30 days ago: 2024-03-05T07:39:05.648Z
07:39:23.157 [main] INFO  o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.io.EOFException: HttpConnectionOverHTTP@529a95d::SocketChannelEndPoint@45110cb4{queryservice.default.svc.cluster.local/10.102.88.47:9999<->/10.244.0.123:37970,ISHUT,fill=-,flush=-,to=7639/0}{io=0/0,kio=0,kro=1}->HttpConnectionOverHTTP@529a95d(l:/10.244.0.123:37970 <-> r:queryservice.default.svc.cluster.local/10.102.88.47:9999,closed=false)=>HttpChannelOverHTTP@77241e03(exchange=HttpExchange@5d1f6df8 req=TERMINATED/null@null res=PENDING/null@null)[send=HttpSenderOverHTTP@221f326(req=QUEUED,snd=COMPLETED,failure=null)[HttpGenerator@4f983b7a{s=START}],recv=HttpReceiverOverHTTP@189be08c(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]], attempt 1, will retry
07:39:25.159 [main] INFO  o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 2, will retry
07:39:29.161 [main] INFO  o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 3, will retry
07:39:37.166 [main] INFO  o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 4, will retry
07:39:47.167 [main] INFO  o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 5, will retry
07:40:01.558 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 1 at (0.0, 0.0, 0.0) updates per second and (0.0, 0.0, 0.0) ids per second

Trying to upload this following lexeme to my local wiki (using the API sandbox) resulted in the one off log shown above:

{"type":"lexeme","lemmas":{"mis":{"language":"mis","value":"narạ̄ṭu"}},"lexicalCategory":"Q1","language":"Q2","claims":{},"forms":[],"senses":[]}

where Q1 and Q2 are items I manually made for this purpose.

@Epidosis I've moved your comment to another ticket for clarity: T361814

Trying to upload this following lexeme to my local wiki (using the API sandbox) resulted in the one off log shown above

Did this also cause a CPU spike in the query service?

This is now done since we seem to have a functional QS for most wikis. T361436 is where work will continue on investigating the unusual behaviour of this specific wikibase

FWIW, in Telegram we have had a report that federated queries to Wikidata are no longer working.

I suspect it is a case similar to Suite as mentioned in T355893#9660154, specifically whitelist.txt is being copied to the WDQS container but the appropriate settings have not been added to use it rather than the new default name allowlist.txt.
{F47006977}

@GreenReaper Thanks for the actionable report. I did indeed miss this change when updating the base image. PR with a fix is found here: https://github.com/wbstack/queryservice/pull/111

An update of the Queryservice using allowlist.txt is now running in production.

Thanks for the quick fix for federation; one of the users reporting the issue says it is all good now, if anyone else still has issues I'll suggest they file a separate issue. 👍

Tarrow closed this task as Resolved.Apr 26 2024, 9:15 AM

Tarrow claimed this task.

Tarrow closed subtask T361571: Platform API returns batches where parent wiki has been deleted as Resolved.

	F44139871: image.png
	Apr 2 2024, 8:53 AM

QueryService updater blockedClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

QueryService updater blocked
Closed, ResolvedPublic
Actions

Related Objects
Search...