Wikipedia:Wikipedia Signpost/2014-10-15/Technology report

Technology report

<big>Attempting<ref>{{citation needed</ref>}} to parse <code>wikitext</code></big>

Share this

This week we sat down with The Earwig to learn about his wikitext parser, mwparserfromhell.

What is mwparserfromhell, and how did it get its name?

mwparserfromhell (which I will abbreviate as mwpfh) is a Python parser for wikicode. In short, it allows bot developers (like those using pywikibot) to systematically analyze and manipulate wikitext, even in cases where it is complex or ambiguous.

For example, let's say we want to see if a page transcludes a particular template, check whether it has a particular parameter, and if not, add it. A classic application would be a bot that dates {{citation needed}} tags. This isn't as simple as it sounds! A naive solution might use regexes, but then we need to check whether the parameter exists between the template's opening and closing brackets, but not get confused if it's inside of a template contained within the template (for example, if you had {{citation needed|reason=This fact is important.{{citation needed|date=October 2014}}}}), whether the template is between <nowiki> tags, and so on...

mwparserfromhell makes this easy by creating a tree representation of the wikicode (loosely described as a parse tree) that can be converted back to wikicode after any modifications are made. It focuses on being as accurate as possible, both in terms of the tree representation being accurate, and the outputted wikicode being as similar to the original as possible.

Its name comes courtesy of Σ, reflecting the somewhat insane nature of the project, and as an excuse for its frightening codebase.

What led you to develop it in the first place?

I’ve been writing bots and tools/scripts for many years – situations like the one above come up a lot. Sure, ad hoc solutions using regexes work sometimes, but I wanted something that would work in more general cases. mwparserfromhell seemed like a project that would be useful to most bot developers, and of which there was no existing equivalent.

What were some of the challenges you faced or things that didn't go according to plan while developing the parser? How did you manage them?

Oh, boy. It turns out that wikicode is a horrible, horrible language, for people and computers alike. It lacks a clear definition of how certain edge cases should be handled, and since mwparserfromhell’s goal is to be accurate, a lot of time was spent just trying to figure out how MediaWiki works. Many language parsers are designed to give up once they see a syntax error, like a missing bracket somewhere, but MediaWiki considers all possible wikitext to be valid, so a lot of mwpfh’s code involves making sense of some very questionable things (like templates nested inside of HTML tag attributes nested inside of external links, or the difference between {{{{{foo}}bar}}} and {{{{{foo}}}bar}}) and handling them as closely as possible to the way MediaWiki does. Sometimes this is hard, but other times it is outright impossible and we have to make guesses. For example, if we imagine that the template {{close ref}} transcludes </ref> and the parser encounters the wikicode <ref>{{cite web|…}}{{close ref}}, it will appear as if the <ref> tag does not end, even though it does. This is a limitation inherent in the nature of parsing wikicode: we have no knowledge of the contents of the template, so we can't figure out every situation. mwparserfromhell compromises as best as it can, by treating the <ref> tag as ordinary text and fully parsing the two templates.

How does mwparserfromhell compare to other re-implementations of the MediaWiki parser, like Parsoid?

Most projects like Parsoid (or MediaWiki’s own PHP parser) are designed to convert wikicode to HTML so that it can be viewed or edited by users. mwparserfromhell converts wikicode into a tree structure for bots, and that structure must contain enough information (such as HTML comments, whitespace, and malformed syntax that other parsers would outright ignore or try to correct) for it to be manipulated and converted back to wikitext with no unintentional modifications. Furthermore, it has less awareness of context than other parsers: because it is designed to deal with wikicode on a fairly abstract level, it doesn't know the contents of a template and can't make any substitutions. As noted above, this causes problems sometimes, but it's necessary for the parser to be useful to bots that are manipulating the templates themselves.

What is the most significant challenge that mwparserfromhell currently faces, and why?

It’s a difficult, exhausting project that would ideally have multiple people working on it. Development has stalled recently as I've been busy with college, and additional eyes would be useful to point out potential issues or help out with open problems.

What's next for mwparserfromhell? Do you have any other cool projects you'd like to tell us about?

Some wikitext constructs (primarily tables, but also parser functions and #REDIRECTs) aren’t understood by mwparserfromhell, so I would like to implement those. There’s actually an open request to review some code for table support that I've been procrastinating on for a couple months now. Other than that, I have some plants to make it more efficient; mwpfh has some speed issues with ambiguous syntax on large pages.

My copyvio detection tool on Wikimedia Labs (which uses mwparserfromhell, by the way!) has seen a lot of improvements lately, including more accurate detection, more detailed search results, and a fresh new API. If you don't know about it or have only used it in the past, I invite you to give it a spin.

← Previous "Technology report"

In this issue

15 October 2014 (all comments)

Op-ed

In the media

Arbitration report

Discuss this story

These comments are automatically transcluded from this article's talk page. To follow comments, add the page to your watchlist. If your comment has not appeared here, you can try purging the cache.

First of all, it's great to see progress on making it easier to edit our content using a variety of tools.

That said, I think that it's worth looking more closely at Parsoid. It also provides a well-defined tree structure, but covers basically every aspect of wikitext. It even marks up multi-template content in a way that makes it easy to replace the entire block of templated content.

The DOM structure it provides can be edited by bots, gadgets or external services like content translation (see a list of current users). There is no limitation to manual editing; any method of manipulating HTML will work. A combination of several algorithms (video) is used to avoid dirty diffs (unintended changes in the wikitext).

We are very interested in improving Parsoid further for bots and other uses. Let us know about your needs. You can find us on IRC in #mediawiki-parsoid. -- GWicke (talk) 14:36, 17 October 2014 (UTC)[reply]

Great to see mwpfh getting some attention, it's been tremendously useful to the code behind SuggestBot. Keep up the great work! Cheers, Nettrom (talk) 14:57, 17 October 2014 (UTC)[reply]

To add to what GWicke said, Parsoid's express goal is to be a bidirectional converter (wikitext -> html; html -> wikitext), be a clean wikitext roundtripper (wikitext -> html -> wikitext without introducing dirty diffs in unedited portions of wikitext), and also be a semantically identical HTML roundtripper (html -> wikitext -> html -- we don't handle arbitrary HTML yet). But, yes Parsoid and mwpfh provide different representations -- Parsoid's representation is HTML5 with some RDF annotations, whereas mwpfh's representation is probably more along the lines of an Abstract Syntax Tree? However, both are well-structured tree representations which can be manipulated fairly easily depending on what representation is found suitable / useful for the application at hand. SSastry (WMF) (talk) 15:32, 17 October 2014 (UTC)[reply]

@GWicke: Thanks for your comments. For what it's worth, Parsoid wasn't as usable or stable when work on mwpfh started, but I digress. I'm still a bit unclear on how Parsoid handles multi-level template nesting. I tried to parse "{{foo|{{bar|{{baz|abc=123}}}}}}" and got <span about="#mwt1" typeof="mw:Transclusion" data-parsoid='{"dsr":[0,31,null,null],"pi":[[{"k":"1","spc":["","","",""]}]]}' data-mw='{"parts":[{"template":{"target":{"wt":"foo","href":"./Template:Foo"},"params":{"1":{"wt":"{{bar|{{baz|abc=123}}}}"}},"i":0}}]}'></span>, and I'm not sure how I could, say, use this to read the value of the "abc" parameter in {{baz}}. Would I need to use Parsoid again on the value of that "wt" key or am I missing something? Part of mwpfh's usefulness for bots is that the trees it generates have methods for common wikicode manipulation – there are simple functions for adding template parameters and the like, modifying and traversing the tree, etc. As far as I know, Parsoid is focused solely on the parsing aspect and doesn't support this kind of stuff directly, but it raises question of whether it could be useful as an alternate backend for mwpfh. Would be annoying to have to deal with outsourcing queries from Python to a node.js subprocess, but it could be an interesting experiment. — Earwig ^talk 17:41, 17 October 2014 (UTC)[reply]

@The Earwig: Thank you for your response as well! The Parsoid and mwpfh projects did indeed start at around the same time. Back then there were no good parsing options for editing, and it wasn't even clear whether full editing of typical wiki content would be technically feasible. Both projects have independently done pioneering work.

You bring up a good use case where the usability of the Parsoid DOM is not optimal for bots interested in nested parameters. Parsoid actually supports exposing the templated parameters as HTML (using the 'html' key instead of 'wt'), but this is not currently enabled in production. We should be ready to switch this on in a month or two.

Generally the idea with Parsoid is to do all the parsing on the server, so that bots don't have to deal with it. The workflow is basically retrieve HTML from the API, edit it, and send the modified HTML back to the API for saving. Convenient APIs for this workflow (especially the saving part) are being worked on right now. I agree that having a more specialized client-side interface / library for specific tasks like template editing is very useful. Your idea of using Parsoid as a backend for mwpfh sounds very promising to me, and could even expand beyond templates into other content. -- GWicke (talk) 21:35, 17 October 2014 (UTC)[reply]
I tried mwpfh once, and I must say I was disappointed. It does not distinguish parser extension tags (like <ref> or <source>) from tags like <b> and <table> which are (mostly) passed through to HTML. If I recall correctly, there is no support for noinclude/includeonly/onlyinclude. Sufficiently tricky markup can get mwpfh really confused. It seems the authors developed it by trial and error instead of actually looking up how MediaWiki parses markup (which is not that hard, really: just read the source). By the way, I just tried the <ref>foo{{close ref}} thing… it does not work as described here (as I expected, because I know how the parser works). You will be better off using mw:API:Expandtemplates with the generatexml option instead. (I would avoid Parsoid too, it has similar warts.) — Keφr 21:33, 20 October 2014 (UTC)[reply]

Get the latest headlines on your user page – just add {{Signpost-subscription}}.

Home

About