Page MenuHomePhabricator

RFC: Implement rendering of redlinks in Parsoid HTML as post-processor
Closed, ResolvedPublic

Description

Requirements

  • Identify links to pages that do not exist.
  • Support different red link behavior on desktop (class, link to edit view) and apps (remove links).
  • Support re-styling of red links for skins & accessibility.
  • Perform well, both server & client side.

Non-requirements

  • Stub thresholds: These can be implemented client side with a separate API request, for users that have this option enabled.

Solutions

1) Separate metadata

Provide a JSON array containing hrefs of pages that do not exist. On most pages, this array will be very short or empty.

Clients can then use this array to find links to non-existing pages, and apply the transformations they would like.

Advantages:

  • Easy to implement server side.
  • Avoids latency overheads of server-side DOM parsing.
  • Can be consistently applied to old content. However, a general limitation of red link updates for old revisions is that current link tables are only available for the latest revision, which means that links in old revisions might not be updated.
  • Consistent matching interface helps clients render red links for specific use cases.

Disadvantages:

  • Client requirements:
    • Process JSON and run JS.
    • Extra API request.
  • API complexity: Need to expose metadata in separate API.

2) Parsoid DOM pass

Retrieve list of red links in page (as in separate metadata solution), and rewrite href / add classes in DOM processing pass.

Advantages:

  • Still relatively easy to implement server side.
  • Low extra cost.
  • No need for JS on the client.
  • Compatible with existing skins.
  • No need for extra API.
  • Records accurate 'at the time' snapshot of how a page looked.

Disadvantages:

  • Implements a specific kind of handling / rewriting, which then needs to be undone / changed for other use cases.
  • Changes in href (ex: linking to edit views) complicate matching for further post-processing. Not touching the href would mean that clients would need to modify the href dynamically, negating the advantage of not needing to execute JS.
  • Does not automatically update for old revisions.
  • Increases fraction of re-renders that change the content.

3) Parsoid DOM pass html2html update end point

As in Parsoid DOM pass, but also update red link markup in old revisions through a Parsoid HTML2HTML end point.

Advantages:

  • As for Parsoid DOM pass.
  • More up-to-date red link information in old revisions.

Disadvantages:

  • Extra complexity in Parsoid and RESTBase.
  • Increased latency and cost for accesses to old revisions, especially if the update is applied on each request.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Soln 3: Provide a Parsoid API endpoint that transforms stored HTML with red link info added in. But, this will require some caching / temporary storage (and associated invalidation) to alleviate performance impact of a secondary api request.
Soln 2 will cause old revisions to render broken.
Soln 1 is the ideal solution.

That said, this question will surface again for "bad images". For "bad images", we do not have the luxury of breaking old revisions. Bad images are probably expected to be blacklisted in all revisions. Plus, client-side solution for that will probably not be acceptable for the no-js case since those images will render in full glory.

It seems like we should pick a solution that works in both scenarios => solution 3?

@ssastry, I have added option 3) as an extension in the task description.

One question we should make up our mind on is whether we would rewrite href attributes to point to edit views, to match what is currently done on the web. Doing so would complicate further processing:

  • VisualEditor would need to strip ?action=edit from its link widget. The same will likely be true for other clients, like mining of link graphs.
  • Matching could not rely on the plain href, but would need to either match on a mangled href, or use other attributes like a class instead.

If we decide to not mangle href, then web clients would need to do so after the fact, negating the advantage of not needing to run JS.

Another consideration is that link updates in old revisions are currently only efficiently possible using link information from the latest revision. This means that links that were only present in the older revision won't be checked for existence in queries based on the latest revision's links.

In the context of the API-driven front-end effort, it seems likely that we'll serve metadata about a page separately. This metadata would contain red links, language links (from wikidata), categories, table of content hints and more. If we assume that we'll need this anyway for other bits of metadata, then the extra cost of a general page metadata request is becoming less of an issue in this discussion.

So, overall I'm still torn, but am starting to lean more towards 1) than 2) / 3).

cscott: the image blocklist is actually very very small, much fewer than 1000 images on there.
cscott: i'm not worried about caching the image as blocked in the archive. it's the same as any other template dependency; the page "as it looks right now" depends on the current state of the image blocklist.
cscott: because it's small and a corner case, i don't think it should drive redlink discussions/decisions.

redlinks are orders of magnitude more common. let's not try to make these two cases the same thing.

To provide a rough idea of the cost of server-side link processing, matching / processing all 1852 links in [[Barack Obama] using https://github.com/wikimedia/elematch takes about 32ms of CPU time. LibXML offers similar performance for a full DOM parse XPath match, but incurs some extra cost for DOM serialization.

The cost is directly proportional to the number of elements matched. For comparison, matching only <figure> elements in Obama takes only 2ms.

Edit: Updated benchmark results at https://github.com/wikimedia/elematch#performance

Edit 2: Assuming XML spec compatible serialization, it turns out that we should actually be able to implement relatively efficient matches on attribute values as well. There are several gotchas around optional entity encoding, but they look solvable so far. I'll try to explore this approach in elematch.

A patch for element & attribute matching is now up at https://github.com/wikimedia/elematch/pull/5. Performance results for Obama (1.5mb):

  • match / replace all figures: 1.95ms
  • match / replace all links: 13.9ms
  • match / replace a *specific* link (a[href="./Riverdale,_Chicago"]): 1.9ms
  • match / replace references section (ol[typeof="mw:Extension/references"]): 3.7ms

Other options:

  • libxml XML DOM round-trip (no matching): 51.1ms
    • plus xpath match: ~15ms
  • domino round-trip (no matching): 255.3ms

This means that we can dynamically update red links in large documents using less than ~2ms of cpu time. This is over a magnitude less than libxml in XML mode, the next best option.

RobLa-WMF mentioned this in Unknown Object (Event).Apr 13 2016, 6:54 PM
Krinkle renamed this task from RFC: Implement rendering of redlinks (in a post-processor?) to RFC: Refactor rendering of redlinks as post-processor.Apr 13 2016, 8:40 PM
GWicke renamed this task from RFC: Refactor rendering of redlinks as post-processor to RFC: Implement rendering of redlinks in Parsoid HTML as post-processor.Apr 13 2016, 10:23 PM

Based on the performance information we have from elematch, I think it is clear that updating red link information server-side is doable at very moderate cost. Here is a proposal:

Implement red link updates in change propagation

The recently deployed change propagation service will follow page creation / deletion events, and will initiate red link updates for all pages linking to the created / deleted pages. (Services)

Parsoid to mark red links with an attribute by default

Regular parse: Parsoid will request a list of red links from the API, and annotate <a> elements with appropriate attributes in a DOM pass.

Update: When getting a 'red link update' request from the change propagation service, RESTBase will pass the current HTML and the update mode ("redlinks") to Parsoid, the same way it already does right now for template or image updates. Parsoid will then update red links using any technique of its choosing. For example, it could do this using elematch, avoiding the need for expensive DOM parsing. (Parsing-Team--ARCHIVED)

Issue: API for red links

As far as I am aware, there is currently no API that lets us efficiently retrieve the list of red links in the current version of a given article. https://en.wikipedia.org/w/api.php?action=help&modules=query+links provides a full list of links, but does not mark red links specially. Nor does it support only retrieving red links.

This end point already supports restricting links to a specific namespace using the plnamespace parameter. Along the same lines, we could add a plbrokenlinks flag parameter, which would restrict the returned list of titles to broken / red links only.

@Anomie @ssastry @Pchelolo @mobrovac @Arlolra, does this sound sane to you?

(Services or Parsing-Team--ARCHIVED)

As far as I am aware, there is currently no API that lets us efficiently retrieve the list of red links in the current version of a given article. https://en.wikipedia.org/w/api.php?action=help&modules=query+links provides a full list of links, but does not mark red links specially.

You could use something like https://en.wikipedia.org/w/api.php?action=query&generator=links&titles=User:GWicke/LinkTest.

This end point already supports restricting links to a specific namespace using the plnamespace parameter. Along the same lines, we could add a plbrokenlinks flag parameter, which would restrict the returned list of titles to broken / red links only.

We might have to disable it (or reduce the functionality) in miser mode though, like we do with various other parameters, to avoid bugs like T97797.

I'll try to remember to look at it next week when I'm over the flu.

@Anomie, thanks for the input! And sorry to hear about the flu- hope you are better soon!

This end point already supports restricting links to a specific namespace using the plnamespace parameter. Along the same lines, we could add a plbrokenlinks flag parameter, which would restrict the returned list of titles to broken / red links only.

We might have to disable it (or reduce the functionality) in miser mode though, like we do with various other parameters, to avoid bugs like T97797.

The current query for this module is reasonably simple, basically

-- Generic
SELECT pl_from, pl_namespace, pl_title FROM pagelinks WHERE pl_from IN (...) ORDER BY pl_from ASC, pl_namespace ASC, pl_title ASC LIMIT 501;

-- With plnamespace
SELECT pl_from, pl_namespace, pl_title FROM pagelinks WHERE pl_from IN (...) AND pl_namespace IN (...) ORDER BY pl_from ASC, pl_namespace ASC, pl_title ASC LIMIT 501;

Adding a plbrokenlinks parameter would require joining with page, though:

SELECT pl_from, pl_namespace, pl_title FROM pagelinks LEFT JOIN page ON (pl_namespace = page_namespace AND pl_title = page_title) WHERE pl_from IN (...) AND page_id IS NULL ORDER BY pl_from ASC, pl_namespace ASC, pl_title ASC LIMIT 501;

Just a join there is probably not too bad (it's an eq_ref which isn't a bad join type, although it's still a join on a 261-byte key so it's not the greatest), but using the join as a filter where the vast majority of the time the filter is going to fail would certainly be a T97797 issue waiting to be reported.

So what you'd end up with on WMF wikis is that using your plbrokenlinks parameter would do a query more like

SELECT pl_from, pl_namespace, pl_title, page_id FROM pagelinks LEFT JOIN page ON (pl_namespace = page_namespace AND pl_title = page_title) WHERE pl_from IN (...) ORDER BY pl_from ASC, pl_namespace ASC, pl_title ASC LIMIT 501;

then on the PHP side it would filter out the rows with $row->page_id !== null, typically leaving zero to actually be returned and a continuation for it to do the same check on the next batch of links.

RobLa-WMF mentioned this in Unknown Object (Event).Apr 20 2016, 6:43 AM

We are currently working on implementing the update logic in the change propagation service: T133221. This will trigger HTML re-renders for all pages linking to a created / deleted page. As part of the re-render request (see T114413), we will supply

  • an update mode ('redlinks', 'brokenlinks' or the like),
  • whether this is a deletion or creation, and
  • the created / deleted title (so that updates can be targeted).

The number of update requests from this will be fairly small, so optimizations are not critical. However, for this to actually result in red links being inserted / updated, Parsoid will need to mark up red links as part of the normal render process.

Here's my understanding of the chronology so far:

  1. Filed by @Jdforrester-WMF in 2012-06-24
    • Title: "VisualEditor: Parsoid needs to hint on each 'internal' link as to the status of its target (redlink; stub; interwiki)"
    • Description: "For the arbitrary stub threshold preference, perhaps just hint with the character length of the link and leave it to the UI to decide how to render based on user preferences? This would mean that we wouldn't fragment the parser cache..."
    • Assignee: @GWicke
  2. @GWicke drops priority to lowest priority on 2012-06-29
  3. Jan 2013: new title "Implement rendering of redlinks and stubs" (moved out of Parsoid and into Bugzilla's "General" project) T39902#413848
  4. Discussion in 2013, suggesting Action API changes, and hinting that "bug 11" (T2011) could be fixed by this
  5. Feb 2014: James reports that @Catrope will attempt to implement this, "at least as a first run" and raises to high priority (T39902#413952).
  6. March 2014: James fixes related issue T39901
  7. Early 2015: James drops Roan as assignee, @ssastry drops priority to normal
  8. Sept 2015: @Jdlrobson expands the task description (T39902#1690785)
  9. March 2016: @GWicke flags this as an TechCom-RFC
  10. April 2016:
    • TechCom very briefly discusses this in E161 as part of our planning/triage
    • @GWicke posts an implementation strategy, but remarks "As far as I am aware, there is currently no API that lets us efficiently retrieve the list of red links in the current version of a given article." (T39902#2208376)
    • @Anomie responds, advising about how to do this, but cautions: "We might have to disable it (or reduce the functionality) in miser mode though, like we do with various other parameters, to avoid bugs like T97797." (T39902#2209625)
    • @GWicke ties this to the change propagation service (T133221) and suggests Parsoid would need to provide redlink markup as part of the re-render request (T114413). (T39902#2240375)
  11. August 2016: @Jdlrobson cross posts "Redlinks in Android app" on mobile-l and wikitech-l: "I've cced wikitech in case anyone has an update."

Next steps? I suppose TechCom can discuss this in E258, and possibly make this a candidate for one of our office hours in the not-to-distant future. We can discuss the TechCom-RFC triage process in Z425.

@Jdlrobson and I just had a really good conversation about this. The thing TechCom (or someone) needs to provide is guidance for what code that someone (possibly Jon?) can contribute in order to accelerate progress on the redlink issue. @Anomie provided advice how to use the Action API to get redlink information. @GWicke, is there any guidance you're waiting on the rest of TechCom for? @ssastry, do you agree with @GWicke that Parsoid needs modification to provide redlink information in the HTML?

@Anomie provided advice how to use the Action API to get redlink information.

Can you put that here?

@Anomie provided advice how to use the Action API to get redlink information.

Can you put that here?

See his April 2016 comments, starting at T39902#2209625

Change 303827 had a related patch set uploaded (by Arlolra):
[WIP] Add redlinks to a doc

https://gerrit.wikimedia.org/r/303827

Comment from Arlo's Gerrit change 303827:

[WIP] Add redlinks to a doc

  • Makes use of the suggestion here: T39902#2209625
  • Since this is still a WIP, provides an internal qs param (_redlinks) to add them to wt2html requests.

This probably belongs in html2html, as described here: T39902#2240375
However, it's not clear that there's a consensus on doing that yet.
See the discussion [in T114413]

Thanks @RobLa-WMF for the summary and the poke. :) Looks like @Arlolra submitted a wip patch already, but a bunch of details still need to be ironed out. So, I am moving that discussion here.

So, it appears that the original idea of storing HTML (without redlinks) and doing a post-process on that HTML (in a html2html) pass has morphed into something different:

  • Render red links in Parsoid as part of the normal render. Internally, the code can continue to be structured as a DOM transformation on the parsed HTML.
  • Rely on Change Propagation to provide Parsoid the information about what redlinks to update which can be done efficiently as a html2html transform (without reparsing the page from scratch). T39902#2199600 has some options.

I am not opposed to this, but, can someone summarize what led to this change? @GWicke ?

But, in any case, no matter how / when the redlink transform happens, in order to get the redlinks info, we would have to query the mediawiki API. Here are a few different options for doing so:

  • @Anomie provided a link to the links API. That can work, but to get the full list of links the continue links would have to be navigated which is a bunch of API calls (in series, not parallel) in the general case which makes this inefficient and adds to the parse time.
  • It would be better would be to have a redlinks api call that only returns missing links (if that can be done efficiently). But, I understand that this is Parsoid-specific which brings me to the next alternative.
  • This can be done as part of the ParsoidBatchAPI extension (which has some custom Parsoid functionality already).
  • However, as far as I understand, all the 3 solutions above assume / require that the core parser to have already processed the page and is an unnecessary dependency. Instead, a better alternative would be to have Parsoid send a bunch of links to the API and request a list of links that don't have a corresponding page entry in the database. This functionality is probably best done in the ParsoidBatchAPI and also eliminates the dependency on the core parser.

Thoughts?

I am not opposed to this, but, can someone summarize what led to this change?

Red links change rarely, and we do have information & infrastructure to propagate those changes. This means that it is significantly cheaper (esp. for the PHP API) to only process red links on change, rather than on each view.

Rely on Change Propagation to provide Parsoid the information about what redlinks to update

This is a possible optimization, but if that precise information is not available right away we could perhaps start by indicating "update all red links" only. See T114413: Support various conversions in Parsoid's pb2pb endpoint for more on the html2html end point.

Rely on Change Propagation to provide Parsoid the information about what redlinks to update

This is a possible optimization, but if that precise information is not available right away we could perhaps start by indicating "update all red links" only. See T114413: Support various conversions in Parsoid's pb2pb endpoint for more on the html2html end point.

Basically we already have (partially) code in CP to propagate redlinks updates, but it's not enabled. What it would do is on each new page creation/deletion it would go over the links table to see which pages were linking to this one and send rerender requests to each of those pages.

So at this point ChangeProp knows which particular redlink was updated and can give this information to RESTBase and Parsoid. So if Parsoid implements this an an html2html endpoint with the semantics of 'add/remove redlink information to this one particular title' - it wouldn't need to call MW API at all.

We could even not involve parsoid and do it with a regex in RESTBase (I'm joking)

Unfortunately, relying on the links API or the current database state isn't enough. We also have to be aware of Title::isKnown(), so things like special pages are blue if they exist, foreign files/global user pages (and future shadow namespaces) are looked up remotely.

So at this point ChangeProp knows which particular redlink was updated and can give this information to RESTBase and Parsoid. So if Parsoid implements this an an html2html endpoint with the semantics of 'add/remove redlink information to this one particular title' - it wouldn't need to call MW API at all.

The base case is not solved ... you are talking about steady state behavior after a revision has been rendered and stored. But, on page edit, when the page is reparsed, you need to set up red links which requires getting information from the API.

Unfortunately, relying on the links API or the current database state isn't enough. We also have to be aware of Title::isKnown(), so things like special pages are blue if they exist, foreign files/global user pages (and future shadow namespaces) are looked up remotely.

This is not a problem per se. Whatever code in Mediawiki provides this functionality can be implemented in the ParsoidBatchAPI extension / API call providing the functionality, unless I am missing something.

So at this point ChangeProp knows which particular redlink was updated and can give this information to RESTBase and Parsoid. So if Parsoid implements this an an html2html endpoint with the semantics of 'add/remove redlink information to this one particular title' - it wouldn't need to call MW API at all.

It should watch out for races. Two pages are created in close succession. The creation of Page_1 triggers a job to update that in the DOM to blue/exists. While that is running, Page_2 is created, triggering another job. It needs to not clobber the first.

Unfortunately, relying on the links API or the current database state isn't enough. We also have to be aware of Title::isKnown(), so things like special pages are blue if they exist, foreign files/global user pages (and future shadow namespaces) are looked up remotely.

I see T141963: Expose Title::isKnown() status via the API already exists.

Also, BTW, the suggested API query already handles Special- and Media-namespace links. That does leave MediaWiki-namespace links and anything using the 'TitleIsAlwaysKnown' hook.

@ssastry, I put this on the list for the next services / parsoid sync meeting. Red links are frequently requested, and we now have the tools to implement this efficiently.

Change 303827 abandoned by Arlolra:
[WIP] Add redlinks to a doc

Reason:
Per offsite

https://gerrit.wikimedia.org/r/303827

To highlight one of the blocked tasks: T119266 makes for a pretty bad user experience in the Android app.

ssastry raised the priority of this task from Medium to High.Apr 14 2017, 6:24 PM

Change 351537 had a related patch set uploaded (by Arlolra; owner: Arlolra):
[mediawiki/services/parsoid@master] WIP: Port VE's LinkCache::styleElement

https://gerrit.wikimedia.org/r/351537

Change 351537 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Implement rendering of redlinks, etc. as a post-processor

https://gerrit.wikimedia.org/r/351537

Mentioned in SAL (#wikimedia-operations) [2017-06-26T18:59:17Z] <mobrovac@tin> Started deploy [restbase/deploy@3975ab2]: Update Parsoid HTML version to 1.5.0 - T39902

Mentioned in SAL (#wikimedia-operations) [2017-06-26T19:05:33Z] <mobrovac@tin> Finished deploy [restbase/deploy@3975ab2]: Update Parsoid HTML version to 1.5.0 - T39902 (duration: 06m 16s)