Page MenuHomePhabricator

Media links with underscores in the URL are dirty-diffed (with no underscores)
Open, MediumPublic

Event Timeline

ssastry triaged this task as High priority.Nov 25 2019, 9:46 PM
ssastry created this task.

officewiki is running 1.35.0-wmf.5 (rMWa473ba08a21b) and VisualEditor 0.1.1 (26ebdd0) 14:26, 5 November 2019.

The revert for T237040 was cherry picked to branch wmf/1.35.0-wmf.2 as commit 87ef3e53e533c9565d226e6a48ed70e673d636d1: https://gerrit.wikimedia.org/r/543956

So this issue shouldn't be caused by T237040, AFAICT.

ssastry renamed this task from Media links dirty diffed on officewiki to Media links with underscores in the URL are dirty-diffed (with no underscoes).Nov 25 2019, 10:04 PM
ssastry renamed this task from Media links with underscores in the URL are dirty-diffed (with no underscoes) to Media links with underscores in the URL are dirty-diffed (with no underscores).

Seems to be present on both enwiki and officewiki, so not a Parsoid/PHP issue.

Cf: https://en.wikipedia.org/w/index.php?title=User:Cscott/T237040&type=revision&diff=927959666&oldid=927959637&diffmode=source

But doesn't occur in straight wt2wt:

$ echo '[[Media:CBQ_RPO_1938.jpg|caption]]' | bin/parse.js --wt2wt
[[Media:CBQ_RPO_1938.jpg|caption]]

VE sends HTML like this back to Parsoid:

<body id=\"mwAA\" class=\"mw-content-ltr sitedir-ltr ltr mw-body-content parsoid-body mediawiki mw-parser-output\" dir=\"ltr\" lang=\"en\"><p id=\"mwAg\"><a href=\"./Media:CBQ_RPO_1938.jpg\" rel=\"mw:WikiLink\" resource=\"./Media:CBQ_RPO_1938.jpg\" title=\"CBQ RPO 1938.jpg\" id=\"mwAw\">This is a caption</a></p>
<p id=\"mwBA\">xyz</p></body>

while Parsoid's wt2wt has HTML like this at the midpoint:

<p data-parsoid='{"dsr":[0,34,0,0]}'><a rel="mw:MediaLink" href="//upload.wikimedia.org/wikipedia/en/f/fb/CBQ_RPO_1938.jpg" resource="./Media:CBQ_RPO_1938.jpg" title="CBQ RPO 1938.jpg" data-parsoid='{"a":{"resource":"./Media:CBQ_RPO_1938.jpg"},"sa":{"resource":"Media:CBQ_RPO_1938.jpg"},"dsr":[0,34,null,null]}'>caption</a></p>

Not clear where the spaces are coming from in the title; they aren't present in the href, resource or data-parsoid. Only place we have spaces is the title attribute.

I don't know whether it makes any difference, but I'd like to point out that this also happens for media links inside <gallery> tags, see example.

I don't know whether it makes any difference, but I'd like to point out that this also happens for media links inside <gallery> tags, see example.

They generally do share serialization code but the dirtying there is coming from T214649, since the gallery presumably wasn't edited in that case. There's also T211895 / T151367 to deal with other normalizations in galleries.