Page MenuHomePhabricator

Investigate & Fix holes for aliases in new term tables (take 3)
Closed, ResolvedPublic

Description

Take 3 as this is the third investigation of holes.

Holes identified in T219301#5844113

mysql:[email protected] [wikidatawiki]> SELECT   wbpt_property_id as id,   wby_name as type,   wbxl_language as language,   wbx_text as text FROM wbt_property_terms LEFT JOIN wbt_term_in_ lang ON wbpt_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbx_text
  IS NULL;
 ------ ------- ---------- ------ 
| id   | type  | language | text |
 ------ ------- ---------- ------ 
| 7850 | alias | NULL     | NULL |
| 7786 | alias | NULL     | NULL |
| 7884 | alias | NULL     | NULL |
| 7885 | alias | NULL     | NULL |
| 1416 | alias | NULL     | NULL |
|  932 | alias | NULL     | NULL |
|  846 | alias | NULL     | NULL |
| 2326 | alias | NULL     | NULL |
|  590 | alias | NULL     | NULL |
| 1566 | alias | NULL     | NULL |
|  356 | alias | NULL     | NULL |
 ------ ------- ---------- ------ 
11 rows in set (6.33 sec)

Checking this again today it appears there might still be one situation that is causing issue that we should investigate.

Acceptance criteria

  • Holes are less likely to happen (transactions)
  • Details of the split between transactions and new job etc are documented
  • "Restore" code after insertion is confirmed to do everything it needs to do
  • Deadlock situation is observed after deployment (hopefully it decreases) see https://logstash.wikimedia.org/goto/956ac31f9e47b1a4a579e4f39c300b2f (should comment on this ticket with what happens)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

So, this is a race condition following by edits such as these in quick succession:

OR

In words the pattern is:

  • Edit1 changing a label text A ->B
  • Edit1 main transaction commits
  • Edit1 acquires IDs for writing to new term store
  • Edit2 adding an alias A
  • Edit2 main transaction commits
  • Edit2 acquires IDs for writing to new term store
  • Edit1 write to term store commits
  • Edit2 write to term store commits

Edit 2 does not lock the rows when acquiring the ids that will be used for it insert.
Thus the previous cleanup can still get rid of them once the ids are known to the code that will perform the insert.
This results in Edit to believing they exist, and entering a row that ends up pointing to nowhere..

Things to think about:

  • Lock when selecting from master for id acquisition
  • Delay the cleanup a bit
Addshore triaged this task as High priority.Feb 3 2020, 1:33 PM

Interestingly this edit pattern would likely occur less if we were no longer using the "old" termbox, as the new termbox uses a single edit to perform this action.

I have a copy of some of the transactions that caused one if the issues on mwdebug1002 now:

addshore@mwmaint1002:~$ cat wbpt_property_id_356.sql

Will continue to look at this tomorrow.

Decoding what happens in the transactions...

For edits:
https://www.wikidata.org/w/index.php?diff=1108479589&oldid=1103620837&title=Property:P356
https://www.wikidata.org/w/index.php?diff=1108479601&oldid=1108479589&title=Property:P356

TRANSACTION 1 (as expected)

INSERT term_in_lang row for (label sr DOI), using pre existing text_in_lang
INSERT wbt_property_terms for term_in_lang (label sr DOI)
DELETE wbt_property_terms for (label sr ДОИ)
cleanTermInLangIds DELETE remove term_in_lang for (label sr ДОИ) 40383
cleanTermInLangIds DELETE remove text_in_lang for (sr ДОИ) 40159 <<<<<<<<<
cleanTextIds DELETE remove (ДОИ) 32573

TRANSACTION 2

INSERT IGNORE into text table (ДОИ идентификатор) new text entry (doesnt matter)
INSERT IGNORE 2x into text_in_lang ....
INSERT IGNORE 3x into term_in_lang for the 3 things being added as an alias
 - One of these is text_in_lang id 40159 (removed by above transaction) <<<<<<<<<
INSERT wbt_property_terms for term_in_lang x3
DELETE wbt_property_terms for x1 (not used removed)
No clean up, and no restoration <<<<<<<<<

Cleanup would select the term in lang ids that are inserted in this transaction (412045689, 412045690, 412045691)
That means that nothing will be re inserted.
text_in_lang tables and ids are never checked or attempted to be reinserted in this case.

Addshore renamed this task from Investigate holes for aliases in new property terms table (take 3) to Investigate & Fix holes for aliases in new property terms table (take 3).Feb 6 2020, 7:34 PM

Change 570391 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] wbterms: Split old & new term writing into different transactions

https://gerrit.wikimedia.org/r/570391

Change 570996 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] WIP DNM Seperate new term storage cleanup

https://gerrit.wikimedia.org/r/570996

mysql:[email protected] [wikidatawiki]> SELECT   * FROM wbt_property_terms LEFT JOIN wbt_term_in_lang ON wbpt_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbx_text IS NULL;
 --------- ------------------ ---------------------- ----------- -------------- ---------------------- -------- ---------- --------- --------------- -------------- -------- ---------- 
| wbpt_id | wbpt_property_id | wbpt_term_in_lang_id | wbtl_id   | wbtl_type_id | wbtl_text_in_lang_id | wby_id | wby_name | wbxl_id | wbxl_language | wbxl_text_id | wbx_id | wbx_text |
 --------- ------------------ ---------------------- ----------- -------------- ---------------------- -------- ---------- --------- --------------- -------------- -------- ---------- 
|  332730 |             7850 |            411848038 | 411848038 |            3 |            407983555 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332732 |             7786 |            411848054 | 411848054 |            3 |            406931806 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332734 |             7884 |            411848132 | 411848132 |            3 |            409421152 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332736 |             7885 |            411848158 | 411848158 |            3 |            409421166 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332926 |              932 |            412045499 | 412045499 |            3 |                88258 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332933 |              846 |            412045592 | 412045592 |            3 |                82909 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332935 |             2326 |            412045606 | 412045606 |            3 |               168274 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332937 |              590 |            412045617 | 412045617 |            3 |                63613 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  334566 |             6656 |            412937757 | 412937757 |            3 |               279086 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  334720 |             1440 |            413008330 | 413008330 |            3 |               123055 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  334722 |             1665 |            413010089 | 413010089 |            3 |               135377 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  334883 |             3417 |            413099852 | 413099852 |            3 |               212096 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  335048 |             4474 |            413348559 | 413348559 |            3 |               240023 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  335149 |             1017 |            413511952 | 413511952 |            3 |            411294102 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  335196 |             6365 |            413582295 | 413582295 |            3 |               274837 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
 --------- ------------------ ---------------------- ----------- -------------- ---------------------- -------- ---------- --------- --------------- -------------- -------- ---------- 
15 rows in set (19.96 sec)

So 4 more small holes in properties.

Properties have a couple more holes, but it looks like the exact same race condition that should be fixed with the split transactions, which is essentially acceptable as it is for now anyway.

I can also confirm that this same edit pattern causes holes in aliases for items.

While making sure Q0 to Q1million was totally populated I noticed that Q408335 appeared in the list of items to fill, when it did not appear in a previous list.
Edits happened just after running in the main bulk of holes:

https://www.wikidata.org/w/index.php?title=Q408335&diff=1117800458&oldid=1117792446
https://www.wikidata.org/w/index.php?diff=1117800465&oldid=1117800458&title=Q408335

The list of holes during this time can be seen raw at https://phabricator.wikimedia.org/P10446

Change 570391 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] wbterms: Split old & new term writing into different transactions

https://gerrit.wikimedia.org/r/570391

Addshore renamed this task from Investigate & Fix holes for aliases in new property terms table (take 3) to Investigate & Fix holes for aliases in new term tables (take 3).Feb 19 2020, 10:18 AM

When the transaction splitting and job gets deployed we will likely and should see a drop in deadlock issues.

We should be able to observe this @ https://logstash.wikimedia.org/goto/956ac31f9e47b1a4a579e4f39c300b2f

So, an update from T244115#5889790 which listed 15 item holes for aliases on the 17th Feb.

Running the query again now shows 21 holes, again all for aliases.

mysql:[email protected] [wikidatawiki]> SELECT   * FROM wbt_property_terms LEFT JOIN wbt_term_in_lang ON wbpt_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JO IN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbx_text IS NULL;
 --------- ------------------ ---------------------- ----------- -------------- ---------------------- -------- ---------- --------- --------------- -------------- -------- ---------- 
| wbpt_id | wbpt_property_id | wbpt_term_in_lang_id | wbtl_id   | wbtl_type_id | wbtl_text_in_lang_id | wby_id | wby_name | wbxl_id | wbxl_language | wbxl_text_id | wbx_id | wbx_text |
 --------- ------------------ ---------------------- ----------- -------------- ---------------------- -------- ---------- --------- --------------- -------------- -------- ---------- 
|  332730 |             7850 |            411848038 | 411848038 |            3 |            407983555 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332732 |             7786 |            411848054 | 411848054 |            3 |            406931806 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332734 |             7884 |            411848132 | 411848132 |            3 |            409421152 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332736 |             7885 |            411848158 | 411848158 |            3 |            409421166 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332926 |              932 |            412045499 | 412045499 |            3 |                88258 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332933 |              846 |            412045592 | 412045592 |            3 |                82909 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  332937 |              590 |            412045617 | 412045617 |            3 |                63613 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  334720 |             1440 |            413008330 | 413008330 |            3 |               123055 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  334722 |             1665 |            413010089 | 413010089 |            3 |               135377 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  334883 |             3417 |            413099852 | 413099852 |            3 |               212096 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  335048 |             4474 |            413348559 | 413348559 |            3 |               240023 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  335149 |             1017 |            413511952 | 413511952 |            3 |            411294102 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  335196 |             6365 |            413582295 | 413582295 |            3 |               274837 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  335576 |              949 |            415036560 | 415036560 |            3 |               288784 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  335589 |             2033 |            415066824 | 415066824 |            3 |               152756 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  335854 |              667 |            415923987 | 415923987 |            3 |                69305 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  336949 |              477 |            417006412 | 417006412 |            3 |                51848 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  337230 |             1755 |            417689972 | 417689972 |            3 |            415460553 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  337269 |              190 |            417808950 | 417808950 |            3 |                26783 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  337367 |             3916 |            418041210 | 418041210 |            3 |               225590 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
|  337371 |              508 |            418041355 | 418041355 |            3 |                54642 |      3 | alias    |    NULL | NULL          |         NULL |   NULL | NULL     |
 --------- ------------------ ---------------------- ----------- -------------- ---------------------- -------- ---------- --------- --------------- -------------- -------- ---------- 
21 rows in set (10.27 sec)

All patches in this ticket are still to be deployed or merged.

Change 575672 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Job queue integration tests for term store writers

https://gerrit.wikimedia.org/r/575672

Change 570996 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] wbterms: Split new term storage cleaning into own transaction

https://gerrit.wikimedia.org/r/570996

Change 575672 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Job queue integration tests for term store writers

https://gerrit.wikimedia.org/r/575672

Job queue still will be deployed with the train this week

Mentioned in SAL (#wikimedia-operations) [2020-03-04T16:02:04Z] <addshore> addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildPropertyTerms.php --wiki=wikidatawiki --sleep 1 --batch-size=50 # T244115

The jobs can be seen running @ https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&from=now-3h&to=now&var-site=eqiad&var-type=CleanTermsIfUnused

image.png (635×1 px, 111 KB)

In order to see if we hit this race condition less I'll rebuild the properties one more time...

State before rebuild:

mysql:[email protected] [wikidatawiki]> SELECT   wbpt_property_id as id,   wby_name as type,   wbxl_language as language,   wbx_text as text FROM wbt_property_terms LEFT JOIN wbt_term_in_lang ON wbpt_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbx_text
 IS NULL;
 ------ ------- ---------- ------ 
| id   | type  | language | text |
 ------ ------- ---------- ------ 
| 7850 | alias | NULL     | NULL |
| 7786 | alias | NULL     | NULL |
| 7884 | alias | NULL     | NULL |
| 7885 | alias | NULL     | NULL |
|  590 | alias | NULL     | NULL |
| 1440 | alias | NULL     | NULL |
| 1665 | alias | NULL     | NULL |
| 3417 | alias | NULL     | NULL |
| 4474 | alias | NULL     | NULL |
| 6365 | alias | NULL     | NULL |
| 2033 | alias | NULL     | NULL |
|  667 | alias | NULL     | NULL |
|  477 | alias | NULL     | NULL |
| 1755 | alias | NULL     | NULL |
|  190 | alias | NULL     | NULL |
| 3916 | alias | NULL     | NULL |
|  508 | alias | NULL     | NULL |
| 2427 | alias | NULL     | NULL |
| 1225 | alias | NULL     | NULL |
| 4223 | alias | NULL     | NULL |
 ------ ------- ---------- ------ 
20 rows in set (5.93 sec)

State after:

mysql:[email protected] [wikidatawiki]> SELECT   wbpt_property_id as id,   wby_name as type,   wbxl_language as language,   wbx_text as text FROM wbt_property_terms LEFT JOIN wbt_term_in_lang ON wbpt_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbx_text
 IS NULL;
Empty set (1.09 sec)

Now to see if and how many of these holes re appear.

assigning to myself for the verification

Change 576883 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Schedule 1 CleanTermsIfUnusedJob per ID to clean

https://gerrit.wikimedia.org/r/576883

Change 576963 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.22] Schedule 1 CleanTermsIfUnusedJob per ID to clean

https://gerrit.wikimedia.org/r/576963

Change 576883 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Schedule 1 CleanTermsIfUnusedJob per ID to clean

https://gerrit.wikimedia.org/r/576883

After 1 day

mysql:[email protected] [wikidatawiki]> SELECT   wbpt_property_id as id,   wby_name as type,   wbxl_language as language,   wbx_text as text FROM wbt_property_terms LEFT JOIN wbt_term_in_lang ON wbpt_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbx_text
    ->  IS NULL;
Empty set (7.52 sec)

Change 576963 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.35.0-wmf.22] Schedule 1 CleanTermsIfUnusedJob per ID to clean

https://gerrit.wikimedia.org/r/576963

Mentioned in SAL (#wikimedia-operations) [2020-03-05T11:29:29Z] <ladsgroup@deploy1001> Synchronized php-1.35.0-wmf.22/extensions/Wikibase: [[gerrit:576963|Schedule 1 CleanTermsIfUnusedJob per ID to clean (T244115 T246898)]] (duration: 01m 08s)

mysql:[email protected] [wikidatawiki]> SELECT   wbpt_property_id as id,   wby_name as type,   wbxl_language as language,   wbx_text as text FROM wbt_property_terms LEFT JOIN wbt_term_in_ lang ON wbpt_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbx_text
   IS NULL;
Empty set (6.98 sec)

Despite reducing this race condition, it looks like it still occurs in day to day operation.

mysql:[email protected] [wikidatawiki]> SELECT   wbpt_property_id as id,   wby_name as type,   wbxl_language as language,   wbx_text as text FROM wbt_property_terms LEFT JOIN wbt_term_in_ lang ON wbpt_term_in_lang_id = wbtl_id LEFT JOIN wbt_type ON wbtl_type_id = wby_id LEFT JOIN wbt_text_in_lang ON wbtl_text_in_lang_id = wbxl_id LEFT JOIN wbt_text ON wbxl_text_id = wbx_id WHERE wbx_text
   IS NULL;
 ------ ------- ---------- ------ 
| id   | type  | language | text |
 ------ ------- ---------- ------ 
| 2163 | alias | NULL     | NULL |
| 3142 | alias | NULL     | NULL |
| 3204 | alias | NULL     | NULL |
| 1630 | alias | NULL     | NULL |
 ------ ------- ---------- ------ 
4 rows in set (6.78 sec)

Caused by the same edit pattern, example:
https://www.wikidata.org/w/index.php?diff=1132754150&oldid=1125355485&title=Property:P2163
https://www.wikidata.org/w/index.php?diff=1132754151&oldid=1132754150&title=Property:P2163

This could now actually be very related to what we are seeing in T246898

! In T246898#8638287, @Addshore wrote:
This is probably the same deadlock case as we had for T151789 with the old storage.
data updates getting in the way of each other when multiple edits happen to the same item in quick succession

Addshore claimed this task.

Going to mark this as resolved for now as the investigation and initial fixing of holes was done.
The issue is now much smaller and followups already exist that would totally eliminate it.
Until then the store will slowly keep itself in check