Summary: we get a Server HTTP error 500 instantly with
{{#language:code1|code2}}
if code2 contains a single or double quote, or an ampersand.
So,
{{#language:en|'}} or {{#language:en|"}} or {{#language:en|&}}
DO crash. As these three character are not valid in BCP47 language/locale codes (or the few legacy non-standard codes used in Wikimedia sites and remaining in various historic pages), the "codes" in parameter are returned verbatim without mapping them to a native language name.
But,
{{#language:'}} or {{#language:'|en}} or {{#language:"}} or {{#language:"|en}} or {{#language:&}} or {{#language:&|en}}
DO NOT crash: the y are returned verbatim (in fact only as decimal numeric character entities.
Details follow.
No language codes shoud ever contain these three characters (but some local extensions may want to use other characters such as spaces/underscores, colons, slashes, arrobaces, dots... but these don't crash the #language function, not even if we attempt to feed non-ASCII characters), so any occurence of these characters in parameter 1 will make #language return the input string verbatim without translating it, so:
"{{#language:français}}" returns "français"
"{{#language:Slovopedia}}" returns "Slovopedia"
Now let's use a valid language code in parameter but feed the second parameter (to indicate that we want the language name translated in another target language, if possible:
"{{#language:fr|en}}" returns "French"
"{{#language:fr|fr}}" returns "français"
"{{#language:fr|de}}" returns "Französisch"
OK now with missing translations (and no fallback):
"{{#language:pdc|ckb}}" returns "Pennsylvany German":
both codes are valid, there's no other fallback than English
"{{#language:pdc|ckb-brai}}" returns "Pennsylvany German":
both codes are valid BCP47 codes, but the Braille script variant of language code "ckb" is still undefined (this would require implementing the transliteration scheme to Braille for this language); the server may retry using BCP47 rules looking for a translation in "diq" only, it does not find it, and after looking for defined fallbacks of "ckb", will finally select the default to give a name of "pdc" in English.
Now with invalid codes:
"{{#language:pdc|ckb brai(1)}}" returns "Plattdütsch":
the second code is invalid under all rules, so it is ignored. No fallback chain can be determined, so the server will try to find the native name (all supported languages in MEdiaWiki have a native name or "autonym".
Now with invalid codes including the apostrophe-quote:
"{{{#language:pdc|ckb it's failing}}" the server crashes with HTTP 500.
This is a serious issue which, could cause a DoS attack on the server, if the following very simple code:
"{{#language:en|'}}"
is inserted in a widely used template, so that it will block the navigation over lots of page (and many server error 500 may drain a lot of resources, if thie eror 500 comes from a PHP instance crash that must be restarted).
This code could be generated by feeding the second parameter with a subpagename (coming from {{SUBPAGENAME}} where it is HTML-encoded, or from {{SUBPAGENAMEE}} where it is URL-encoded with the legacy "WIKI" style).
To correct this:
The 2nd parameter of #language must be checked like the 1st one; if the string is longer than allowed language codes (you could accept up to the max length of a page name), or if it contains characters in ['"&], treat this parameter as an invalid language code, and ignore it (but you can still use the 1st code to return the autonym mapped to it)
For now, on Mediawiki-Wiki I completed the following article about the issues and tricky details (and other related bugs/inconsistencies I discovered)
[[mw:Manual:PAGENAMEE encoding]]
Look at the table in this page showing the effects of the various encodings used in pagenames or for the three styles of urlencodings and anchorencode.
But the real issue in this bug report is in #language.
To avoid this bug, in pages that attempt to detect if a page is a translation or the source page of trnaslations by checking the content of their last subpagename, I also performed many tests to make sure that
[[m:Template:Pagelang]] on Meta-Wiki and on MediaWiki-Wiki will now NEVER return any subpage name that:
- matches the full page (this is not a subpage of another base page, so it is not a translation produced by the Translate extension).
- is idempotent through {{lc:{{PAGENAME|...}}}} (this excludes subpagenames containing capital letters and any characters forbidden or transformed in pagenames)
- contains any character that remains HTML-encoded after calling {{titleparts}} (these are the three characters ['"&])
- contains any other characters than [a-z0-9-.], i.e the only characters that are idempotent in all encodings, including URL-encoding in its most restrictive style ("QUERY" style since MediaWiki 1.17).
- does not start by a letter (this can be tested by comparing "lc:" to "ucfirst:lc:" as they MUST be different (given that only ASCII letters are allowed)
We could add other filters against some subpagenames codes passing this test, such as "doc" or "layout", "testcases", "sandbox", used in templates (they are not valid BCP47 language codes, except "doc"; unfortunately documentation subpages of templates on English or Multilingual wikis use "/doc"; but for now we have never encountered the need to translate to this encoded language)
We could also apply stricter rules (to make sure that they are also valid domain name labels, i.e. at most 64 ASCII characters, no double hyphens, no trailing hyphens, if we exclude IDNA labels interlanguage prefixes).
This means that all codes will be lowercase only (even if BCP47 codes are case insensitive, this gives less false positives with accidental subpages that could be created starting by a capital ASCII letter, such as:
"User:Kennedy/Bob"
But the following page name will accidentally match Indonesian when "id" is a subtemplate returnnin a numeric id, but is not a translation of "Template/Page":
"Template:Page/id"
We can hope that users trying to use common templates on their user subpages will avoid naming them using sequences that could match valid language codes. These few pages could be moved/redirected if needed: here it could be renamed:
"Template:Page/Id"
so that it will no longer match a language code detected by the rules above.
Also, independantly of the language codes supported in MEdiaWiki and in the new Translate extension, there are still lots of legacy codes used in subpages that mean specific variants of languages (they don't always match the BCP47 rules, but at least they should only use ASCII lowercase letters, hyphens, and digits, and no spaces/undescores or quotation marks; the few existing pages depending on these code could be reworked to change their codes to private codes conforming to BCP47 rules)
Version: 1.23.0
Severity: normal