After seeing a message from a PM we investigated and saw a load of errors in the update log.
Seems to all be related to QS batches from Wiki framenet-akkadian257
Tarrow | |
Apr 2 2024, 8:17 AM |
Restricted File | |
Apr 16 2024, 4:50 PM |
F44139871: image.png | |
Apr 2 2024, 8:53 AM |
After seeing a message from a PM we investigated and saw a load of errors in the update log.
Seems to all be related to QS batches from Wiki framenet-akkadian257
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Tarrow | T361551 QueryService updater blocked | |||
Resolved | BUG REPORT | Fring | T361571 Platform API returns batches where parent wiki has been deleted |
We plan to manually mark as failed all batches from this Wikis so they are no longer blocking the other wikis from updating
Manually failed the batches using:
> Wiki::where(['id' => 1023])->first() = App\Wiki {#60508 id: 1023, domain: "framenet-akkadian257.wikibase.cloud", sitename: "FrameNet Akkadian 257", deleted_at: null, created_at: "2024-03-26 04:19:43", updated_at: "2024-03-26 04:19:43", description: null, is_featured: 0, domain_decoded: "framenet-akkadian257.wikibase.cloud", } > QsBatch::where(['wiki_id' => 1023, 'done' => 0])->update(['failed' => 1]) = 22339
Could also be that the queryservice itself is straining under load. It seems to be out of memory We got muddled with staging
Log entries there look like:
We saw a lot of uptime checks in the logs (obviously) and we also noticed that we started being charged (!) for these around the time the incident happened so we wondered if this was part of the issue. We manually changed the path of the uptime jobs in Google Cloud Dashboard to target something that will 404 in the hope that this solved our issue.
It did not
We killed the one QS pod and saw it's CPU return to 0. We let it soak for 5 mins and then rescaled back up the QS updater. Things seems to be happy and we'll allow the batches to catch up.
Once this is done our next step is to enable all the batches we marked as failed from framenet-akkadian257
It seems some of the pending batches belonged to wikis that have since been deleted so I manually failed these to unclog the queue
QsBatch::doesntHave('wiki')->where(['done' => 0, 'failed' => 0])->update(['failed' => 1])
I partially undid the manual failing for the potentially offending wiki next
> QsBatch::where(['wiki_id' => 1023, 'done' => 0, 'failed' => 1])->limit(500)->update(['failed' => 0, 'processing_attempts' => 0]) = 500
Seeing that processing these items made the Queryservice CPU max out again I decided to "re-fail" the offending items in order to restore the service
> QsBatch::where(['wiki_id' => 1023, 'done' => 0, 'failed' => 0])->update(['failed' => 1]) = 499
Service should now be restored for all wikis but https://framenet-akkadian257.wikibase.cloud/
The service is not working perfectly: qualifiers and references aren't loaded in the query service, as well as statements not having best rank.
As a check: using https://hypotheseis.wikibase.cloud/query/#PREFIX hd:
07:38:27.936 [main] INFO o.w.query.rdf.tool.WbStackUpdate - Starting Updater 0.3.84 (3bcfddc0f526c9a227773e68fe29419343d33993) --wikibaseHost asdfasdf.wbaas.localhost --ids Q1,Q2 --entityNamespaces 120,122,146 --sparqlUrl http://queryservice.default.svc.cluster.local:9999/bigdata/namespace/qsns_b61b483051/sparql --wikibaseScheme http --conceptUri https://asdfasdf.wbaas.localhost 07:38:59.157 [main] INFO o.w.q.r.t.change.ChangeSourceContext - Checking where we left off 07:38:59.158 [main] INFO o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the updater 07:39:04.159 [main] INFO o.w.query.rdf.tool.rdf.RdfRepository - Checking for left off time from the dump 07:39:05.648 [main] INFO o.w.q.r.t.change.ChangeSourceContext - Defaulting start time to 30 days ago: 2024-03-05T07:39:05.648Z 07:39:23.157 [main] INFO o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.io.EOFException: HttpConnectionOverHTTP@529a95d::SocketChannelEndPoint@45110cb4{queryservice.default.svc.cluster.local/10.102.88.47:9999<->/10.244.0.123:37970,ISHUT,fill=-,flush=-,to=7639/0}{io=0/0,kio=0,kro=1}->HttpConnectionOverHTTP@529a95d(l:/10.244.0.123:37970 <-> r:queryservice.default.svc.cluster.local/10.102.88.47:9999,closed=false)=>HttpChannelOverHTTP@77241e03(exchange=HttpExchange@5d1f6df8 req=TERMINATED/null@null res=PENDING/null@null)[send=HttpSenderOverHTTP@221f326(req=QUEUED,snd=COMPLETED,failure=null)[HttpGenerator@4f983b7a{s=START}],recv=HttpReceiverOverHTTP@189be08c(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]], attempt 1, will retry 07:39:25.159 [main] INFO o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 2, will retry 07:39:29.161 [main] INFO o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 3, will retry 07:39:37.166 [main] INFO o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 4, will retry 07:39:47.167 [main] INFO o.w.query.rdf.tool.HttpClientUtils - HTTP request failed: java.util.concurrent.ExecutionException: java.net.ConnectException: Connection refused, attempt 5, will retry 07:40:01.558 [main] INFO org.wikidata.query.rdf.tool.Updater - Polled up to 1 at (0.0, 0.0, 0.0) updates per second and (0.0, 0.0, 0.0) ids per second
Trying to upload this following lexeme to my local wiki (using the API sandbox) resulted in the one off log shown above:
{"type":"lexeme","lemmas":{"mis":{"language":"mis","value":"narạ̄ṭu"}},"lexicalCategory":"Q1","language":"Q2","claims":{},"forms":[],"senses":[]}
where Q1 and Q2 are items I manually made for this purpose.
Trying to upload this following lexeme to my local wiki (using the API sandbox) resulted in the one off log shown above
Did this also cause a CPU spike in the query service?
This is now done since we seem to have a functional QS for most wikis. T361436 is where work will continue on investigating the unusual behaviour of this specific wikibase
FWIW, in Telegram we have had a report that federated queries to Wikidata are no longer working.
I suspect it is a case similar to Suite as mentioned in T355893#9660154, specifically whitelist.txt is being copied to the WDQS container but the appropriate settings have not been added to use it rather than the new default name allowlist.txt.
{F47006977}
@GreenReaper Thanks for the actionable report. I did indeed miss this change when updating the base image. PR with a fix is found here: https://github.com/wbstack/queryservice/pull/111
Thanks for the quick fix for federation; one of the users reporting the issue says it is all good now, if anyone else still has issues I'll suggest they file a separate issue. 👍