- Training models
- Chavacano de Zamboanga Wikipedia cbk-zam
- Min Dong Chinese Wikipedia cdo
- Chechen Wikipedia ce
- Cebuano Wikipedia ceb
[x] Chamorro Wikipedia ch- Cherokee Wikipedia chr
- Cheyenne Wikipedia chy
- Central Kurdish Wikipedia ckb
- Corsican Wikipedia co
[x] Cree Wikipedia cr- Crimean Tatar Wikipedia crh
- Kashubian Wikipedia csb
- Church Slavic Wikipedia cu
- Chuvash Wikipedia cv
- Welsh Wikipedia cy
- Italian Wikipedia it
- Models verification
- Publish Datasets
- Populate the excluded section titles (see instructions)
- Deploy back-end
- Check how the model works on the wikis
- In Search, use hasrecommendation:link to find articles
- Test them on https://api.wikimedia.org/service/linkrecommendation/apidocs/#/default/get_v1_linkrecommendations__project___domain___page_title_
- Inform communities
- Deploy front-end - March 15
Description
Details
- Due Date
- Mar 15 2023, 5:00 PM
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | • lbowmaker | T307881 Scaling of link suggestions service | |||
Open | Trizek-WMF | T304110 [EPIC] Deploy "add a link" to all Wikipedias | |||
Resolved | Sgs | T304550 Deploy "add a link" to 6th round of wikis |
Event Timeline
Generating datasets and training models for the 16 wikis in this round went well.
Now working on model evaluation.
Model evaluation has been completed and below are the backtesting results:
[email protected] | [email protected] | |
cbk_zamwiki | 0.90 | 0.65 |
cdowiki | 0.93 | 0.47 |
cewiki | 0.99 | 0.45 |
cebwiki | 1.00 | 0.96 |
chwiki | 0.98 | 0.90 |
chrwiki | 0.86 | 0.66 |
chywiki | 0.97 | 0.57 |
ckbwiki | 0.79 | 0.36 |
cowiki | 0.81 | 0.52 |
crwiki | 1.00 | 0.65 |
crhwiki | 0.72 | 0.27 |
csbwiki | 0.94 | 0.79 |
cuwiki | 0.72 | 0.48 |
cvwiki | 0.84 | 0.41 |
cywiki | 0.93 | 0.70 |
itwiki | 0.84 | 0.46 |
CCing @MGerlach, in case he would like to add comments on the backtesting evaluation.
The conclusion on the backtesting results is that most of the languages look fine besides:
- crhwiki and cuwiki whose precision is below the recommended one by 0.03.
Talked to @MGerlach about crhwiki and cuwiki and he said:
The crhwiki and cuwiki are just below the 0.75 threshold. Thus, I think it could still be fine to deploy. In fact, I think we did deploy for some wikis with a similar threshold (I think it was bnwiki). I would call it out explicitly that it is at the edge but wouldnt necessarily recommend against it.
@kostajh, we completed training models for the sixth round of wikis (listed in the task description) and shared the models' evaluation above. We are now ready to publish the datasets for all wikis in this round since they passed the model evaluation, should we proceed?
@kostajh, thank you for the confirmation. We have published the datasets for all 16 wikis.
@kevinbazira it looks like Chechen wiki (cewiki) didn't get into the list of wikis in wikis.txt for some reason, and so the data was not imported by the link recommendation application. Could you please re-run the pipeline for that wiki? All the other ones appear to be loaded correctly.
Thank you for letting me know @kostajh. I have started re-running the pipeline for Chechen Wikipedia - cewiki.
Will let you know when it has completed.
@kostajh, the cewiki pipeline has completed running successfully and I have published the datasets.
I ran this script for adding the link-recommendation task type and and populating the excluded sections:
PHAB=T304550 for WIKI in cbk_zamwiki cdowiki cewiki cebwiki chwiki chrwiki chywiki ckbwiki cowiki crwiki crhwiki csbwiki cuwiki cvwiki cywiki itwiki; do ORIGIN=`mwscript getConfiguration.php $WIKI --settings 'wgCanonicalServer' --format json | jq --raw-output '.wgCanonicalServer'` mwscript extensions/GrowthExperiments/maintenance/changeWikiConfig.php $WIKI \ --page MediaWiki:NewcomerTasks.json \ --create-only \ --json \ --summary "Growth features configuration boilerplate ([[phab:$PHAB]])" \ link-recommendation \ '{ "type": "link-recommendation", "group": "easy" }' jq "select(.wiki==\"$WIKI\" and .probability > 0.25) | .section" wiki_sections.jsonl \ | jq --slurp --compact-output "unique" \ | mwscript extensions/GrowthExperiments/maintenance/changeWikiConfig.php $WIKI \ --page MediaWiki:NewcomerTasks.json \ --json \ --summary "machine-generated configuration for excluding sections from link recommendations ([[phab:$PHAB]]), feel free to improve" \ link-recommendation.excludedSections \ "`cat`" echo "$ORIGIN/wiki/MediaWiki:NewcomerTasks.json" echo "$ORIGIN/w/index.php?title=MediaWiki:NewcomerTasks.json&diff=next" echo "Press <Enter> to continue" read # give time for manual verification done
I checked the configuration and it seemed to be correctly updated in all wikis. The only mentions worth are csb and chy which didn't get any excluded section. Others like chr only got "references" as an excluded section.
Change 888664 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):
[operations/mediawiki-config@master] GrowthExperiments: Enable link recommendation for 6th round wikis
Change 888664 merged by jenkins-bot:
[operations/mediawiki-config@master] GrowthExperiments: Enable link recommendation for 6th round wikis
Mentioned in SAL (#wikimedia-operations) [2023-02-16T14:16:01Z] <taavi@deploy1002> Started scap: Backport for [[gerrit:888664|GrowthExperiments: Enable link recommendation for 6th round wikis (T304550)]]
Mentioned in SAL (#wikimedia-operations) [2023-02-16T14:17:50Z] <taavi@deploy1002> taavi and sgimeno: Backport for [[gerrit:888664|GrowthExperiments: Enable link recommendation for 6th round wikis (T304550)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet
Mentioned in SAL (#wikimedia-operations) [2023-02-16T14:25:24Z] <taavi@deploy1002> Finished scap: Backport for [[gerrit:888664|GrowthExperiments: Enable link recommendation for 6th round wikis (T304550)]] (duration: 09m 23s)
All models work fine except:
- cbk-zam: search returns add a link results, but the API returns "Unable to process request for wikipedia/cbk-zam"
- ce search returns "There were no results matching the query."
- I tested all articles on ch, as only two returned.
- cr search returns "There were no results matching the query."
@Sgs, how possible is it to fix these languages?
The idea is to deploy front-end next week. If we can't fix some languages until then, they we will put them on the next round.
I think cbk-zam is working fine, it's just the API expects the domain to be cbk_zam.
- ce search returns "There were no results matching the query."
I can't tell why yet, I will investigate further. Any ideas? cc @kevinbazira
- I tested all articles on ch, as only two returned.
ch has only ~540 pages, could this be a reason? cc @kevinbazira
- cr search returns "There were no results matching the query."
cr has only ~160 pages, could this be a reason? cc @kevinbazira
That sounds fine to me; ce, cr, ch still need further analysis
Ha, thank you for catching it! Inconsistency regarding naming norms got me.
Tested: it works.
That sounds fine to me; ce, cr, ch still need further analysis
OK, I'm adding all languages except ce, cr, ch to Tech News.
Note: we can add the three excluded wikis to the list if we have a fix before Monday early afternoon UTC.
@Sgs, in case the issue is caused by the model, yes, the number of links in a wiki affects how the model performs.
Visiting https://ch.wikipedia.org/wiki/Special:Random and https://cr.wikipedia.org/wiki/Special:Random, my unscientific sample of 10 visits pulled up 10 one-sentence articles, all of which had a link. mwaddlink will struggle to provide links for one-sentence articles.
I see a different error: Unable to process request for wikipedia/ce which would indicate that the dataset is not loaded into the production database.
@kevinbazira it looks like cewiki never made it to the wikis.txt file at https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/wikis.txt, so the service didn't load its datasets. I'm not sure how that would have happened, because it should be part of the training pipeline.
I guess we should skip the front-end deployment to these two wikis but we could leave the back-end enabled and check again in some time. Or would this be a waste of resources?
Thanks, per discussion in Slack I have updated wikis.txt and added cewiki cc @kevinbazira. The service now doesn't complain about the pair wikipedia/ce. I'll wait 24-48h and check if there are any recommendations created. If there are we can add cewiki back to the 6th round of deployment cc @Trizek-WMF
@kostajh and @Sgs, thank you for letting me know about cewiki missing in the wikis.txt. When we trained this wiki the process completed successfully and we published the datasets here:
https://analytics.wikimedia.org/published/datasets/one-off/research-mwaddlink/cewiki/
I am not sure why it would be the only one missing in the wikis.txt yet all wikis in this round followed the same pipeline. Going to investigate this. In case there are any further issues please let us know, we shall rerun the pipeline for this wiki.
Checked ce.wp, and it works. I added it to Tech News. We are now ready to process the deployment at the scheduled date.
Change 899673 had a related patch set uploaded (by Sergio Gimeno; author: Sergio Gimeno):
[operations/mediawiki-config@master] GrowthExperiments: enable frontend of link recommendation for 6th round wikis
Change 899673 merged by jenkins-bot:
[operations/mediawiki-config@master] GrowthExperiments: enable frontend of link recommendation for 6th round wikis
Mentioned in SAL (#wikimedia-operations) [2023-03-15T20:13:23Z] <samtar@deploy2002> Started scap: Backport for [[gerrit:899673|GrowthExperiments: enable frontend of link recommendation for 6th round wikis (T304550)]], [[gerrit:892363|GrowthExperiments: Enable backend of link recommendation for 7, 8, 9th round wikis (T304551 T308133 T308134)]]
Mentioned in SAL (#wikimedia-operations) [2023-03-15T20:14:55Z] <samtar@deploy2002> sgimeno and samtar: Backport for [[gerrit:899673|GrowthExperiments: enable frontend of link recommendation for 6th round wikis (T304550)]], [[gerrit:892363|GrowthExperiments: Enable backend of link recommendation for 7, 8, 9th round wikis (T304551 T308133 T308134)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet
Mentioned in SAL (#wikimedia-operations) [2023-03-15T20:23:36Z] <samtar@deploy2002> Finished scap: Backport for [[gerrit:899673|GrowthExperiments: enable frontend of link recommendation for 6th round wikis (T304550)]], [[gerrit:892363|GrowthExperiments: Enable backend of link recommendation for 7, 8, 9th round wikis (T304551 T308133 T308134)]] (duration: 10m 12s)
Checked some wikis from the list - ce.wp, it.wp, cy.wp, chy.wp. Generally works as expected - no new issues; logstash has recorded only one error for it.wp (Link suggestion not found...). No logstash errors for other wikis.