Page MenuHomePhabricator

Run the maintenance script cleanupTitles.php on all wikis to rescue currently-inaccessible pages
Closed, ResolvedPublic

Description

MediaWiki 1.32.0-wmf.5 includes two changes which disallow some new characters in page titles:

Existing pages with titles containing them will become inaccessible (except by page ID, e.g. https://en.wikipedia.org/?curid=8091). I don't know how many pages with directional formatting characters are out there, but for soft hyphens there were 2322 when I last checked (T121979#3923914).

2024 update: Those changes that motivated this task were reverted, but there are still a few hundred inaccessible pages in various projects, caused by other old bug fixes, software updates and configuration changes.

Someone needs to run cleanupTitles.php to fix them. But please do a dry run first (php maintenance/cleanupTitles.php --dry-run) and post the results here. As far as I know the script is not run regularly, so there might be other broken titles in the database already (I think it's likely we have some broken due to Unicode normalization changes after we upgraded ICU recently). It would be good to know what we're dealing with first.

The script moves inaccessible pages to titles with the invalid characters removed, or if that is impossible (such page already exists, or title is still invalid), to page titles beginning with "Broken/". [It doesn't properly move them (with log entries etc.), it just updates some database fields, so there will be no on-wiki record of this having been done.] We should advise users to move them to proper titles afterwards (maybe in Tech News or separate announcement).

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Hello @Pppery, this is still not ready to be announced, right?

Indeed, it's still not ready.

Indeed, it's still not ready.

Okay, thanks!

Based on the dry run we did in T196088, the following number of pages on each wiki is affected. It shouldn't be a great burden on the users to correct them, even if all of them required manual fixes (except for hewikisource, which is a unique situation: T314733). I suppose I could schedule it some time next week, unless anyone has objections.

wikinumber of pages
acewiki1
afwiki5
alswiki1
angwiki7
arbcom_ruwiki1
arwiki32
arwikiversity5
arywiki3
arzwiki1
avwiki3
awawiki1
azwikibooks46
azwikisource35
bdwikimedia4
be_x_oldwiki1
betawikiversity1
bewiki1
bgwiki6
bswiki9
cawiki2
chrwiki1
collabwiki1
commonswiki3
csbwiktionary2
cswiki9
cywiki3
dewiki1
diqwiki4
dvwiki1
elwiki11
elwikisource3
enwiki15
enwikinews1
eswiki3
eswikisource3
euwiki15
fowiki1
frrwiki1
frwiki49
frwikiversity1
fywiki22
fywiktionary1
guwikiquote4
hawiki1
hewiki1
hewikinews1
hewikisource42960
hiwiki5
hiwikibooks1
hrwiki3
huwiktionary2
idwiki4
inhwiki14
itwikinews2
jawiki31
jawikibooks1
jawikisource4
jbowiki1
jvwiki4
kabwiki3
kawiki2
knwikiquote1
knwiktionary1
kowiki1
kowiktionary2
kuwiki1
kywiki2
labswiki24
labtestwiki94
metawiki24
mswiki2
mywiki5
newiki5
newiktionary1
niawiki5
nlwiki5
nrmwiki1
pnbwiki1
pswiki8
ptwiki13
ptwikibooks1
ptwikinews3
ptwikiversity19
quwiki49
rmwiki1
roa_tarawiki2
rowiki2
ruwiki2
ruwikiversity23
ruwiktionary1
satwiki1
sawikisource22
sdwiki48
shiwiki1
shwiki11
siwiki3
specieswiki1
sqwikibooks1
srwiki2
srwikinews1
svwiki1
svwiktionary5
tawiki1
tawikisource2
test2wiki42
testcommonswiki3
testwiki4
tewiki8
tewikibooks1
tgwiki1
tpiwiki1
trwiki3
trwikisource4
trwiktionary1
ukwiki1
urwiktionary1
uzwiki2
viwiki52
viwiktionary12
warwiki4
wawiktionary3
yowiki1
zghwiki3
zh_classicalwiki6
zh_min_nanwiki1
zhwiki2
zhwikiversity2
zhwiktionary5
zuwiki3

(Maybe better to do it in the week after the next, to give Tech News translators time. It's not urgent.)

Yeah, fine with that. That move action was just a signal to the Tech News team that this should now be ready.

On timing, I think it's best for the tech news entry to go out just after the script is run, so wikis that see it can immediately look at their list of prefixed titles.

I added the note to next week's Tech News: https://meta.wikimedia.org/wiki/Tech/News/2024/34 (please edit) and scheduled the maintenance script for Monday afternoon: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240819T1300, which should be about the same time as the newsletter goes out (the script will take a few hours to run across all wikis).

We'll handle hewikisource separately this week, see T314733.

Re: Tech News - thanks for the addition!
In the entry-text, I wonder if we can clarify what "deal with them appropriately" means?

E.g. If it's simple, we could clarify in-line.
If it's more complicated, then perhaps we should link that string of text to this Phab-task, and add some details and/or examples to the top of the Description here, to provide guidance for any confused editors?
[I'd guess some details will become clear once we can see the actual page-content within those pages?!?]

I.e. I've looked at the old test-run .log output above (T195546#4234515), and it does not seem immediately obvious to me how to fix/resolve the entries that I expect will appear in the onwiki listings...

For example, one of the entries is:

commonswiki: DRY RUN: would rename 998731 (3,'­Hephaestos') to (3,'Hephaestos')

and if I run those 2 strings through this tool - https://www.babelstone.co.uk/Unicode/whatisit.html - I learn that the first '­Hephaestos' has a hidden "SOFT HYPHEN" character before the H.
Checking those titles directly, I see

But as an editor who sees T195546/Hephaestos in the prefix-listing, what am I expected to do with that page to "deal with it"? Is it usually going to be... just move the page to the appropriate Namespace:Title after checking the content is suitable/not-just-vandalism?

Yes, or often delete it.

The Commons Hephaestos won't be touched because the soft-hyphen-stripping code was removed (this is now only addressing older cleanups).

Here's what I will do with all of the cases on enwiki when this is done:

enwiki:  DRY RUN: would rename 3208957 (0,'Ϊ́') to (0,'Broken/Ϊ́')

Delete entirely. These are two different redirects from different forms of the same greek character. I guess Unicode normalization rules changed at some point. The redirect https://en.wikipedia.org/w/index.php?title=Ϊ́&redirect=no should be retargeted to the section I guess, but that's not important.

enwiki:  DRY RUN: would rename 11429643 (3,'195.175.037.008') to (0,'Broken/User_talk:195.175.37.8')

Copy-paste the {{Blocked proxy}} template from 2007 hiding in the history to the history of the new page so it's there,.

enwiki:  DRY RUN: would rename 11429792 (3,'195.175.037.8') to (0,'Broken/User_talk:195.175.37.8')

Ditto (this will actually appear as "T195546/id:11429792" since the previous page claimed the base broken title).

enwiki:  DRY RUN: would rename 11429818 (3,'195.175.037.08') to (0,'Broken/User_talk:195.175.37.8')

Ditto (this will actually appear as "T195546/id:11429818")

enwiki:  DRY RUN: would rename 11429822 (3,'195.175.037.6') to (0,'Broken/User_talk:195.175.37.6')

Ditto

enwiki:  DRY RUN: would rename 11927597 (3,'203.160.001.146') to (0,'Broken/User_talk:203.160.1.146')

Ditto

enwiki:  DRY RUN: would rename 11927623 (3,'61.220.150.002') to (0,'Broken/User_talk:61.220.150.2')

Ditto

enwiki:  DRY RUN: would rename 21375485 (0,'Ϋ́') to (0,'Broken/Ϋ́')

This is effectively the same as the first instance, and it looks like the broken page can be deleted.


That Commons page, which is from an old log so won't be affected, is a double mess. There are two distinct user accounts, one is "<soft hyphen>Hephaestos", and the other is "Hephaestos". If T121979 happens, one account will need to be renamed, which will move the page, which will make the community not care.

There are a few user talk pages for named users in the cleanupTitles output. I'm not sure if something needs to be done with them (are there invalid user accounts backing them? If so the users need to be renamed too)

I wonder if we should, as a separate task, run checkUsernames.php and clean those up. There doesn't seem to be a maintenance script to normalize usernames, but there might be few enough it could be done manually.

Perhaps we could clarify the Tech News entry by changing the current:

A maintenance script was run to clean up unreachable pages, moving them to Special:PrefixIndex/T195546/. Your community should check if any pages exist there, and deal with them appropriately. [1]

to something like…

A maintenance script was run to clean up unreachable pages (due to Unicode issues), moving them to Special:PrefixIndex/T195546/. Your community should check if any pages exist there, and deal with them appropriately, such as by moving them to the appropriate Title, or by deleting them. [1]

Confirmation, or Alternative/clearer suggestions, would be appreciated!
And/Or edits to the Description here, if that is likely to be helpful. Thanks again.

User accounts with invalid user/talk pages:

jawiki:  page 2803577 (利用者:ϳοτ) doesn't match self.
jawiki:  DRY RUN: would rename 2803577 (2,'ϳοτ') to (0,'Broken/User:Ϳοτ')
jawiki:  page 2803580 (利用者‐会話:ϳοτ) doesn't match self.
jawiki:  DRY RUN: would rename 2803580 (3,'ϳοτ') to (3,'Ϳοτ')

(two separate accounts that have been blocked for ages, I don't care)

arwiki:  page 2163884 (نقاش_المستخدم:إياد_عبد_الصمد_ﻣﻬﺪﻱ) doesn't match self.
arwiki:  DRY RUN: would rename 2163884 (3,'إياد_عبد_الصمد_ﻣﻬﺪﻱ') to (3,'إياد_عبد_الصمد_مهدي')
arwiki:  page 2334883 (نقاش_المستخدم:ﻫﺎﻧﻰ_خيرى) doesn't match self.
arwiki:  DRY RUN: would rename 2334883 (3,'ﻫﺎﻧﻰ_خيرى') to (3,'هانى_خيرى')
...

and a bunch of other arwiki usernames. None of these pages contain anything other than a form-letter welcome message, so I'm inclined to not care what happens to them.

There are also a lot of invalid user talk pages with no corresponding account. These can just go with the flow.

There's also this scary-looking but ultimately harmless ptwikiversity cluster:

ptwikiversity:  page 12186 (Usuário(a):Sumone10154/common.js) doesn't match self.
ptwikiversity:  DRY RUN: would rename 12186 (0,'Usuário(a):Sumone10154/common.js') to (2,'Sumone10154/common.js')
ptwikiversity:  page 12187 (Usuário(a):Meisam/common.js) doesn't match self.
ptwikiversity:  DRY RUN: would rename 12187 (0,'Usuário(a):Meisam/common.js') to (2,'Meisam/common.js')
ptwikiversity:  page 12194 (Usuário(a):Techman224/common.js) doesn't match self.
ptwikiversity:  DRY RUN: would rename 12194 (0,'Usuário(a):Techman224/common.js') to (2,'Techman224/common.js')
....

The pages are all not in the JavaScript content model so won't be executed, and all they do is load the user's global.js which only would load anyway.

Question: Which perfix should we search for? From the tech news announcement it should be T195546, but the logs above suggest Broken.

Nothing yet since no scripts have been run. When it actually happens it will go out in Tech News with the correct prefix (which will probably be T195546)

Do we have a fresh full dry run result? (I see excerpts and statistics, but no full result except for the 2018 one.) That would allow people to rename pages in advance of the script run (without leaving redirects!) if they want, making the page history easier to understand and saving unnecessary renames.

But that's not quite fresh because it's missing any changed caused by https://gerrit.wikimedia.org/r/c/mediawiki/core/ /1058215

[...] That would allow people to rename pages in advance of the script run (without leaving redirects!) if they want, making the page history easier to understand and saving unnecessary renames.

But, can we even rename them onwiki? Will MediaWiki not attempt to compute the illegal name into an internal link in the history, at best for a red link if we are not leaving redirects?

I think you can use the API to rename the page on-wiki because it checks only the page ID. I don't get why you would do that rather than just running the script, though.

Alright. So it looks like the last problem is the faulty display of these titles in a page's revision history. It's not a big problem, but perhaps the script could also correct the comments on the revisions?

Edit: nevermind, the script is raw updating the tables, so there is no "real" move in a history.

Thanks!

I think you can use the API to rename the page on-wiki because it checks only the page ID.

That’s exactly what I had in mind.

I don't get why you would do that rather than just running the script, though.

Because it results in an easier-to-understand history: instead of seeing that the page has been renamed from Broken/Talk:WP:FOO to Wikipedia talk:FOO (this example assumes that 48e90e8ec547 hasn’t happened, but just because it’s easier to understand), the user looking at the page history sees that the page was renamed from Talk:WP:FOO to Wikipedia talk:FOO, which is closer to the reality.

However, in cases where the script can find a sensible new name for the page, it’s probably best to let the script do its job.

Because it results in an easier-to-understand history: instead of seeing that the page has been renamed from Broken/Talk:WP:FOO to Wikipedia talk:FOO (this example assumes that 48e90e8ec547 hasn’t happened, but just because it’s easier to understand), the user looking at the page history sees that the page was renamed from Talk:WP:FOO to Wikipedia talk:FOO, which is closer to the reality.

In other cases, the history you get might be less easy to understand – e.g. if it showed that [[WP:FOO]] was moved to [[Wikipedia:FOO]], and these are both links to the same page, I would definitely find that confusing and have nothing to go on to explain the change. If it instead shows that [[T195546/WP:FOO]] was moved (we'll use T195546/ as a prefix rather than Broken/), that's confusing at first, but I can look up what T195546 is using my favorite search engine and eventually learn what happened.

Also, I'm not sure how much I trust the page move API to work well when moving pages by ID from an invalid title… I wouldn't want the invalid titles to be replicated in log entries or something, and necessitate another maintenance script to clean up those.

In other cases, the history you get might be less easy to understand – e.g. if it showed that [[WP:FOO]] was moved to [[Wikipedia:FOO]], and these are both links to the same page, I would definitely find that confusing and have nothing to go on to explain the change.

You would have something to go to: the explanation and links in the log entry comment. (Of course, assuming that the comment is properly filled in, but I hope we can assume this.) And that comment can link directly here, avoiding the user having to go to DuckDuckGo (to DDG, since Google doesn’t find it, at least not for me).

Also, I'm not sure how much I trust the page move API to work well when moving pages by ID from an invalid title… I wouldn't want the invalid titles to be replicated in log entries or something, and necessitate another maintenance script to clean up those.

I think it should be fine for titles that are simply no longer normalized (like WP:FOO or ΐ). I don’t know how badly broken it is for titles that are actually invalid (Talk:WP:FOO, containing soft hyphen etc.), but if it’s very badly broken, that’s a bug, since those titles may already exist in the database, in years-old log entries.


Given that the script moves don’t have any on-wiki traces (especially in the fortunate cases that don’t end up in Special:PrefixIndex/T195546/), I promised in Tech News that the log will have been posted here. Could you please make me not lie? Thanks in advance!

Mentioned in SAL (#wikimedia-operations) [2024-08-19T13:02:07Z] <Lucas_WMDE> START lucaswerkmeister-wmde@mwmaint1002:~$ foreachwiki maintenance/cleanupTitles.php --prefix=T195546 --reporting-interval=1000000000 2>&1 | tee ~/T195546.log

Mentioned in SAL (#wikimedia-operations) [2024-08-19T13:34:18Z] <Lucas_WMDE> UTC afternoon backport config window done (except for the T195546 maintenance script which is expected to keep running for a few more hours, currently at commonswiki)

Mentioned in SAL (#wikimedia-operations) [2024-08-19T18:12:13Z] <Lucas_WMDE> FINISHED lucaswerkmeister-wmde@mwmaint1002:~$ foreachwiki maintenance/cleanupTitles.php --prefix=T195546 --reporting-interval=1000000000 2>&1 | tee ~/T195546.log

And for convenience, just the interesting parts of it, i.e. only output from wikis where pages were actually updated (3k lines instead of 8k): P67388

Generated with:

grep -E "$(grep -F 'Finished page' T195546.log | grep -v ' 0 of ' | cut -d: -f1 | tr '\n' '|' | sed 's/|$//')" T195546.log

There seems to be a bug in the script which makes it log moving the page to the wrong namespace in some circumstances. It still moves the page to the correct namespace. This wasn't caught in the dry run because the dry run uses different logging code.

Change #1063862 had a related patch set uploaded (by Pppery; author: Pppery):

[mediawiki/core@master] CleanupTitles: log the correct namespace when not in dry run mode

https://gerrit.wikimedia.org/r/1063862

Change #1063862 merged by jenkins-bot:

[mediawiki/core@master] CleanupTitles: log the correct namespace when not in dry run mode

https://gerrit.wikimedia.org/r/1063862

So, is there a way to get a right log?

No.

But see Tasciaspi's comment on the paste:

However, you don’t need to know the namespace to access the page: you can use the curid index.php parameter instead, passing the very first number in the given line to it. For example, if the line you’re interested in is

angwiki:  renaming 8321 (1,'Ƿicipǣdia:Ȝemǣnscipe_Ingang') to (1,'Ȝemǣnscipe_Ingang')

you can access the concerned page at https://ang.wikipedia.org/w/index.php?curid=8321 – MediaWiki will figure out what the correct namespace is.

I spotted that there are a handful of svwiktionary entries which were moved because they start with "f:" which is now an interwikilink. Is there a workaround I could suggest to the community involving e.g. DISPLAYTITLE? Being a Wiktionary it doesn't really make sense to have the title be anything other than the lemma (which in these cases really start with "f:").

Is there a workaround I could suggest to the community involving e.g. DISPLAYTITLE?

DISPLAYTITLE won’t work: it expects to have a value that, when put in the search box, a link etc., brings to the current page – however, f:s brings you to Wikifunctions, so it won’t be allowed, whatever the actual title is. On enwiktionary, they work this around by prefixing the title with Unsupported titles/: for example, f:s would be at https://en.wiktionary.org/wiki/Unsupported_titles/f:s. Another option would be using a different colon than the plain :, but I don’t know if that’s typographically acceptable.

For the future, I think it probably would have been smarter to move these pages to [[Project:Task/formerly broken title]] rather than just moving into the mainspace, given how almost no one likes things in the mainspace with obviously bad names. :) (Ouch at hewikisource's count.)

On the other hand putting them in mainspace forces that "almost everyone" to take action rather than leaving the pages to languish forever.

Almost everyone is usually not a realistic expectation for small wikis.

My "almost everyone" was the inverse of your "almost no one", but I guess for small wikis both groups are empty so you kind of have a point there. Anyway I've done https://meta.wikimedia.org/w/index.php?title=Global_sysops/Requests&diff=prev&oldid=27318439 to tell the global sysops to clean up on wikis with no active administration.

No.

But see Tasciaspi's comment on the paste:

However, you don’t need to know the namespace to access the page: you can use the curid index.php parameter instead, passing the very first number in the given line to it. For example, if the line you’re interested in is

angwiki:  renaming 8321 (1,'Ƿicipǣdia:Ȝemǣnscipe_Ingang') to (1,'Ȝemǣnscipe_Ingang')

you can access the concerned page at https://ang.wikipedia.org/w/index.php?curid=8321 – MediaWiki will figure out what the correct namespace is.

Tried this, does not work.

I see. Still don't know how could I see that.

Post-closure note: there are a lot of cases where the malformed title was a redirect pointing to same place as the title it conflicts with. It might have made sense for the script to delete these, instead of forcing the community to delete them themselves.

I just noticed when looking through the log file that it crashed on ptwiki, which thus still has some invalid titles. Anyone want to rerun the maintenance script on that wiki?

Change #1069697 had a related patch set uploaded (by Pppery; author: Pppery):

[mediawiki/core@master] CleanupTitles: Check if title exists from primary database

https://gerrit.wikimedia.org/r/1069697

Mentioned in SAL (#wikimedia-operations) [2024-09-02T13:06:28Z] <TheresNoTime> [samtar@mwmaint1002 ~]$ mwscript maintenance/cleanupTitles.php --wiki=ptwiki --prefix=T195546 2>&1 | tee ~/T195546-ptwiki.log for T195546

[snipped]

Checking and fixing bad titles...
Processing page...
page 1726347 (Usuário(a):195.175.037.08) doesn't match self.
renaming 1726347 (2,'195.175.037.08') to (2,'195.175.37.8')
page 1726349 (Usuário(a):195.175.037.6) doesn't match self.
renaming 1726349 (2,'195.175.037.6') to (2,'195.175.37.6')
page 1726360 (Usuário(a):195.175.037.8) doesn't match self.
renaming 1726360 (2,'195.175.037.8') to (0,'T195546/User:195.175.37.8')
page 5121018 (F:_A_Todo_o_Gás) doesn't match self.
renaming 5121018 (0,'F:_A_Todo_o_Gás') to (0,'T195546/f:A_Todo_o_Gás')
Finished page... 4 of 5752906 rows updated

Change #1069697 merged by jenkins-bot:

[mediawiki/core@master] CleanupTitles: Check if title exists from primary database

https://gerrit.wikimedia.org/r/1069697

And it turns out (unrelated to this task) there are a lot of wikis with "Broken/" titles from a prior run of cleanupTitles years ago (and hundreds to thousands of pages in total)

Why am I even trying if everything else gets left unkempt for so long?