make Parser::getTargetLanguage aware of multilingual wikis
Open, HighPublic
Actions

Assigned To

None

Authored By

	daniel
	Oct 5 2015, 10:53 AM

Description

Context:
Some wikis need the ability to localize page output to the user's preferred language. A good example would be Wikimedia Commons, which uses the Translate extension to localize help and policy pages, and the {{int:...}} parser function to localize file description pages. These mechanism also cause the parser cache to be split by user language.

Currently, core lacks a mechanism that would allow extensions to know in what language a page is currently being rendered. Parser::getTargetLanguage and ParserOptions::getTargetLanguage currently return the wiki's content language in nearly all cases, and give no indication of whether the content is actually being localized or not, and has by itself no impact on whether the parser cache gets split.

Proposal:

It should be possible for a wiki to specify that pages should be shown in the user language (could be done globally for all pages, or by namespace, or triggered by a magic word on the page). When rendering a page that is multilingual, ParserOptions::getTargetLanguage is set to the user language.
The parser cache, the {{int:...}} function, and other functionality that may depend on the user language, like formatting code for wikidata statements, would rely on ParserOptions::getTargetLanguage to tell them what language to use.
Parser::getTargetLanguage should return ParserOptions::getTargetLanguage unchanged, the logic currently in Parser::getTargetLanguage should be migrated to whatever code sets the target language in the options.
Possibly drop Title::getDisplayLanguage and ContentHandler::getDisplayLanguage completely, or move the functionality elsewhere (maybe into the Language object).

Details

	Subject	Repo	Branch	Lines /-
	[WIP] improve semantics of Parser::getTargetLanguage.	mediawiki/core	master	203 -79

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T308487 Article content (in the "content language") often has user-interface elements ("in the UX language") mixed in
		Open		None	T114640 make Parser::getTargetLanguage aware of multilingual wikis

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@daniel sounds right.

In T114640#1702325, @daniel wrote:

For pages that have sections in different languages, the page content language and the display language could differ in theory by section. This would probably be modeled best by having separate Content objects and separate Parser and ParserOptions objects for each such section, i.e. this would need some kind of composite content model.

I'm just suggesting that when we document this API we be explicit that it's the (display) language for the current section. Even if we don't actually support this yet in practice, let's define this API to Do The Right Thing when that happens.

daniel moved this task from P1: Define to Request IRC meeting on the TechCom-RFC board.Oct 14 2015, 8:32 PM

Does parsoid have a similar concept?

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptNov 4 2015, 9:57 PM

This has been scheduled for discussion on IRC #wikimedia-office on November 11, 22:00 UTC (2pm PST), see E89: RFC Meeting: Parser::getTargetLanguage / PageRecord (2015-11-11)

Nemo_bis subscribed.Nov 10 2015, 7:59 PM

@Spage no, Parsoid does not (yet). It will need to add such support when we implement <translate> and/or LanguageConverter support.

Copying some discussion from https://lists.wikimedia.org/pipermail/wikitech-l/2015-November/083932.html:

I believe the title language support is for the LanguageConverter extension. They used to (ab)use the {{DISPLAYTITLE:title}} magic word in order to use the proper language variant, something like:

{{DISPLAYTITLE:-{en-us:Color; en-gb:Colour}-}}

Then support was added to avoid the need for this hack, and just Do The Right Thing. I don't know the details, but presumably Title::getDisplayLanguage is part of it.

Then Brian Wolff <[email protected]> wrote:

(As an aside, TOC really shouldn't split parser cache imo, and that's something I'd like to fix at some point [...])

Then you'll be interested in taking a look at T114057: Refactor table of contents.

daniel updated the task description. (Show Details)Nov 11 2015, 8:09 PM

Nikerabbit subscribed.Nov 11 2015, 8:34 PM

Some more points from the discussion at https://lists.wikimedia.org/pipermail/wikitech-l/2015-November/083932.html:

Extension:SIL can use the PageContentLanguage hook to overide the language used to render the page.
Anomie notes that it's unclear how links should be tracked for renderings of different languages
CScott nots that Title::getDisplayLanguage is probably used by the LanguageConverter extensions to avoid hacks like {{DISPLAYTITLE:-{en-us:Color; en-gb:Colour}-}}

There are three other ways that variant information can be specified, which shouldn't be broken:

Via Accept-Language header (I believe)
Via explicit URL parameter: https://zh.wikipedia.org/w/index.php?title=科学&variant=zh-tw&uselang=fr should result in the effective target language being zh-tw (for content language zh). It should not be come fr, and not default to zh.
Also (for some wikis) explicitly in the URL, eg: https://zh.wikipedia.org/zh-cn/科学 sets the variant to zh-cn.

Could you describe how you would avoid cache and storage fragmentation

in RESTBase HTML storage,
in our CDN infrastructure?

daniel moved this task from Request IRC meeting to Old on the TechCom-RFC board.Nov 18 2015, 9:34 PM

daniel mentioned this in T119593: Define the list of "must have" sessions for WikiDev '16.Nov 25 2015, 9:42 PM

Extension:SIL can use the PageContentLanguage hook to overide the language used to render the page.

Content language can also be altered by Special:PageLanguage in core and by the Translate extension, probably others.

Nemo_bis added a project: MediaWiki-Internationalization.Nov 26 2015, 3:03 PM

This was discussed at the RFC meeting on IRC on November 11. Minutes from https://tools.wmflabs.org/meetbot/wikimedia-office/2015/wikimedia-office.2015-11-11-21.59.html:

$wgContLang and ParserOptions::getUserLangObj() would be pretty much unused (DanielK_WMDE, 22:13:27)
for variants Ex:Translate Ex:Wikibase: instead of getting the desired output language from global state, it should be possible to get it from Parser and/or ParserOptions (DanielK_WMDE, 22:15:52)
<aude> afaik, there is inconsistency when {{int}} is used in how the cache is split vs. getTargetLanguage (DanielK_WMDE, 22:15:53)
use page language for localized parser function names, etc (DanielK_WMDE, 22:17:12)
IDEA: need to decide whether to deprecate $wgContLang and ParserOptions::getUserLangObj() (robla, 22:18:37)
https://lists.wikimedia.org/pipermail/wikitech-l/2015-November/083932.html (DanielK_WMDE, 22:18:38)
<TimStarling> link trails and prefixes fall into the same category (DanielK_WMDE, 22:18:57)
transcluding pages into pages with a different page language could cause confusion wrt parser function names, etc (DanielK_WMDE, 22:24:07)
IDEA: distingiush between requested and effective target language. use *effective* target language when calling parser functions. (DanielK_WMDE, 22:25:56)
for Ex:Translate, Foo would be user language, but Foo/de would be page language, which would not be content language, but overwritten by the suffix. (DanielK_WMDE, 22:35:14)
A request lile https://zh.wikipedia.org/w/index.php?title=科学&variant=zh-tw&uselang=fr should result in the effective target language being zh-tw (for content language zh). It should not be come fr, and not default to zh. (DanielK_WMDE, 22:39:48)
https://zh.wikipedia.org/zh-cn/科学 is another way of writing an explicit 'variant=zh-cn' parameter, and should also be supported on zhwiki (cscott, 22:48:13)

Personal take-away from the discussion:

Agreement that we should not use global state to determine the output language when generating HTML. So it has to come from Content(Handler) and ParserOptions somehow.
Agreement that this is a good idea in general, but we should be careful not to break things. Nobody is sure what might break. Translate, ContentTranslation, and Variants are prime candidates for breakage, but no immediate issue was identified.
Variant selection is done with the variant parameter, not via uselang. That should probably be consolidated.
- Variant selection is currently not reflected by $wgLang/RequestContext::getLanguage(), but probably should be.
We must keep apart the following:
- the wiki's content language (site-wide default)
- the user's interface language (possibly overwritten via uselang)
- a page's content language (the language a page is written in - typically, but not always, the content language)
- the desired display language for a page request (typically, but not always, the user language)
- the effective display language of the page
ParserOptions::getUserLangObj() should probably go away

From this follows:

ParserOptions::getTargetLanguage() should return the desired output language. The desired language will usually be the wiki's content language, unless a page is considered "multilingual".
A page is considered "multilingual" either by virtue of its content model, or by per-namespace configuration (e.g. the File namespace on Commons)
Parser::getTargetLanguage() should return the effective output language.
ContentHandler::getPageLanguage() should return the page's content language.
ContentHandler::getPageViewLanguage() should be re-puposed to determine the effective target language based on the page's content language and the desired target language.
How page content language and desired target language are interpolated to form the effective target language depends on the content model:
- for Wikibase entity pages, the content language is irrelevant, so the desired language would be the effective target language.
- for regular wikitext pages, the content language is dominant, but automatic translation/transliteration can be applied to get closer to the desired output language, if such a translation is supported.
- for multilingual wikitext pages (e.g. on commons), the content language is ignored, so the desired language would be the effective target language.
- for system messages in the MediaWiki namespace, the content language is defined by the title suffix, and the desired target language is ignored.

Next steps: write code that a) sets ParserOptions::getTargetLanguage b) make Parser::getTargetLanguage use ContentHandler to determine the effective language.

Variant selection is done with the variant parameter, not via uselang. That should probably be consolidated.

Why? Variants are about content language, not interface language.

the desired display language for a page request (typically, but not always, the user language)

This is also handled by $wgTranslatePageTranslationULS and compact interwikis, see there for todos.

ParserOptions::getTargetLanguage() should return the desired output language.

Sounds error prone, see above.

for system messages in the MediaWiki namespace, the content language is defined by the title suffix, and the desired target language is ignored.

I don't see how this is different from the general case. It's the same in Translate, where each translation page, stored at a language code subpage, has a content language equal to said language code. Content model doesn't seem to be the matter.

The Wikidata team decided to work on this until the end of the year. We'll try to propose core patches, and discuss them at the summit.

daniel mentioned this in T119032: WikiDev 16 working area: Software engineering.Dec 16 2015, 6:04 PM

@daniel -- do you have any slides or materials you want to present for this at the session tomorrow?

Smalyshev subscribed.Jan 5 2016, 8:45 PM

Strawman proposal, based on conversations at the session and after:

Have the PHP parser track the language specified by <... lang="xxx"> tags in the source, and return the appropriate language code from getTargetLanguage. (If it is too hard to parse these, we can introduce some easier-to-parse form like {{#lang:foo}} but I'd prefer to use the markup which is already present in our content if we can.)
Define a nonstandard language code to be used for "the current user interface language". Something like "x-ui", constructed so that it will never conflict with a valid HTML5 language code. This language code would be replaced on the fly in the parser with the appropriate language.
Either change {{int:...}} to respect the current target language of the parser and localize the message returned, or if that breaks too much existing stuff, then introduce a new parser function (say, {{intx:...}}) which does so. (If we don't change {{int}}, then {{int}} should probably expand to <span lang="x-ui">{{intx:...}}</span> for consistency, and so that the output HTML is appropriately marked up.)

I *think* that is a complete solution to the problem. It may not be the best solution. It may not even be a complete solution. Help me improve/replace it.

Things I like about this: the <span lang="foo"> markup needed to make this work ends up in the resulting HTML. Various languages depend on proper language tagging of the output to (for example) select the proper shaper to use to render arabic script, or to perform word breaking correctly. So tagging the output is a good idea, and doing so makes message localization "Just Work".

@tstarling It seems to me like step (1) above is possible, but you understand the PHP parser better than I do. Can you see any showstoppers? The only thing that worries me is out-of-order parsing, which would make it impossible to use a simple stack to maintain the current target language as we parse.

@tstarling and @daniel think that it would be too hard to properly parse html-style tags in the preprocessor (in particular, finding the "matching" close tag so that the language stack is properly maintained).

So let's adjust the strawman to use {{#lang:foo|....content....}} for now, which should expand to <span lang="foo">...parsed content...</span> in addition to setting the current parser target language. We might need to tweak this a little more to allow generating <div lang="...">...</div> as well, sigh. Please help me paint that bikeshed.

T114432: [RFC] Heredoc arguments for templates (aka "hygienic" or "long" arguments) could eventually be used to make it easier to include ...content... without having to deal with template-argument-escaping the content.

Tobi_WMDE_SW added a project: Wikidata-Sprint-2016-01-19.Jan 19 2016, 4:26 PM

Related notes from the developer summit: T119022#1916790

Rather than {{#lang}}, what about using {{#tag:span|content|lang=de}} / {{#tag:span|content|lang=x-ui}}?

Benefit: no new parser functions added, just some extra code added to #tag processing to maintain the language stack.

Disadvantage: probably not as clear to read as a dedicated {{#lang}} tag; also content comes first and the language code comes last, which may be unexpected.

I wasn't aware of T69223… has that been considered here and also for multilingual pages?

• Purodha subscribed.Mar 6 2016, 4:10 PM

daniel mentioned this in T122942: RFC: Support language variants in the REST API.Mar 23 2016, 8:25 PM

Bianjiang subscribed.Mar 24 2016, 1:28 AM

daniel mentioned this in T130567: WIP RFC: Hygienic transclusions for WYSIWYG, incremental parsing & composition: Options and trade-offs.Apr 13 2016, 3:35 PM

I've been knee-deep in the PHP parser recently, so I might have a better handle on how to implement some of the ideas I presented above. Would it be worth prototyping the {{#lang}} or {{#tag:span}} options above, to see if they can actually work? Often I learn new things about the problem domain by trying to actually implement something.

Danny_B added a project: Proposal.May 2 2016, 10:15 PM

In my opinion it is worth trying, but only you know if it is not away from some other important work.

• RobLa-WMF mentioned this in Unknown Object (Event).May 4 2016, 7:33 PM

• RobLa-WMF mentioned this in E187: RFC Meeting: triage meeting (2016-05-25, #wikimedia-office).May 25 2016, 7:03 AM

Scott_WUaS updated the task description. (Show Details)May 25 2016, 9:44 PM

Scott_WUaS subscribed.

Belated priority update discussed in E187: RFC Meeting: triage meeting (2016-05-25, #wikimedia-office) (see log at P3179)

• DannyH mentioned this in T121731: Investigation: Assistance with structured data on Commons.Jun 13 2016, 10:49 PM

daniel added a subscriber: siebrand.Jun 22 2016, 1:06 PM

Change 295549 had a related patch set uploaded (by Daniel Kinzler):
[WIP] improve semantics of Parser::getTargetLanguage.

https://gerrit.wikimedia.org/r/295549

gerritbot added a project: Patch-For-Review.Jun 22 2016, 4:27 PM

Recapping some old discussion in E168: RFC Meeting: Support language variants in the REST API (2016-04-27, #wikimedia-office), there's the question of whether the "target language" and "user interface language" need to be distinct and/or specified separately. My strawman example is a user on zhwiki who has a target variant set to zh-hant but has the UX language (image metadata labels, {{int}} output, page UI) set to, say, de.

cscott mentioned this in T101666: Create parser tag(s) that render OOUI PHP widgets.Jul 20 2016, 9:44 PM

daniel added a project: User-Daniel.Dec 6 2016, 6:16 PM

daniel mentioned this in E89: RFC Meeting: Parser::getTargetLanguage / PageRecord (2015-11-11).Dec 9 2016, 7:46 AM

cscott mentioned this in T153761: Incorrect parser output if -{{ appears in wikitext.Jan 3 2017, 10:05 PM

daniel moved this task from Inbox to To Do on the User-Daniel board.Jan 5 2017, 7:02 PM

Liuxinyu970226 subscribed.Jan 19 2017, 7:44 AM

Krinkle removed a project: Proposal.Dec 21 2017, 11:38 PM

daniel mentioned this in T194263: Deprecate/Remove ContentHandler::makeParserOptions().Jul 11 2018, 6:23 PM

Tgr mentioned this in T206101: Sort out how page/rendering language related properties are used.Oct 3 2018, 6:07 AM

cscott mentioned this in T202481: Parser should have a msg() helper function so people don't localize messages improperly.Oct 29 2018, 9:50 PM

For last decade or so Wikimedia Commons was relying on MediaWiki:Lang message to fetch user's preferred language. This message can also be found on many other multilingual projects, like Wikidata, Meta-wiki, Wikispecies, MediaWiki, foundation, etc. Current standard interface is {{int:Lang}} in templates and frame:callParserFunction( "int", "lang" ) in Lua. From the perspective of someone writing templates and lua codes that use this mechanism, I do not see much need to change the current interface. Perhaps it would be nicer for all the wikis to use the same mechanism without a need to set it up separately on each wiki, with a subpage for each language, but I would prefer to stick to the current interface ({{int:Lang}}) as changing it would trigger a need to update a lot of templates and modules on a lot of wikis.

@Jarekt the proposal is not to remove {{int:lang}}, it's about how {{int:lang}} and similar things work internally.

Dropping from RFC board, since no RFC is needed.

Aklapper edited projects, added Patch-Needs-Improvement; removed Patch-For-Review.Aug 10 2020, 5:13 AM

cscott mentioned this in T4085: Add a {{USERLANGUAGE}} magic word.Aug 26 2020, 6:02 PM

daniel removed daniel as the assignee of this task.Oct 27 2020, 3:53 PM

Winston_Sung subscribed.Oct 28 2021, 7:35 AM

cscott mentioned this in T272943: Make InputBox extension compatible with Parsoid.Dec 13 2021, 10:30 PM

cscott added a parent task: T308487: Article content (in the "content language") often has user-interface elements ("in the UX language") mixed in.May 16 2022, 7:49 PM

cscott mentioned this in T269492: Selecting user language in the REST API.Aug 11 2022, 2:26 PM

cscott mentioned this in T318860: Deprecate and remove Parser::getFunctionLang().Sep 28 2022, 7:52 PM

daniel mentioned this in T341244: ParserOptions and Title::getPageViewLanguage may disagree on the lang/dir.Jul 7 2023, 8:41 AM

One strawdog proposal is something like:

{{#wrapLang|<new-lang-code>|<content>}}

(which improves with heredocs, T114432)
which, in addition to properly setting the lang and dir tags on a <div> or <span> wrapper around the content, would also reset the Parser::getTargetLanguage() when parsing the content.

Using the special string user for the <new-lang-code> would set the target language to the user's UX language (whatever that is).

To go a step further, a new parser function named something like {{#int2|<msg name>}} (sorry, please bikeshed the name) would return the given message in the current *target language*. Then you can redefine {{#int}} to be equivalent to {{#wraplang|user|{{#int2:<msg name>}}}}.

That provides a more-or-less consistent definition of Parser::getTargetLanguage() as "the language we are right now emitted parsed content into", and we can actively discourage content/extensions from emitting content in a different language without doing the appropriate steps to set the parser target language appropriately.

Ladsgroup subscribed.Aug 1 2023, 3:10 PM

Ladsgroup mentioned this in T343131: Commons database is growing way too fast.Aug 9 2023, 9:30 AM

cscott mentioned this in T299369: Consider removing global $userLang from onPageContentLanguage hook.Nov 14 2023, 7:03 PM

Diana_B1ack subscribed.Nov 15 2023, 5:12 PM

RP88 subscribed.Jan 2 2024, 10:48 PM

cscott mentioned this in T373257: Pick wikitext syntax for wikifunction calls.Sep 26 2024, 7:44 PM

cscott mentioned this in T313748: Allow translatable templates to be shown in the user interface language.Oct 10 2024, 2:17 PM

In T114640#1918177, @cscott wrote:

If we don't change {{int}}, then {{int}} should probably expand to <span lang="x-ui">{{intx:...}}</span> for consistency, and so that the output HTML is appropriately marked up.

It shouldn’t: for example, {{int:lang}}, mentioned above by @Jarekt, should not emit any HTML – it’s used within HTML attributes, so extra markup would break things badly.

In T114640#9086338, @cscott wrote:
One strawdog proposal is something like:
{{#wrapLang|<new-lang-code>|<content>}}
(which improves with heredocs, T114432)
which, in addition to properly setting the lang and dir tags on a <div> or <span> wrapper around the content, would also reset the Parser::getTargetLanguage() when parsing the content.

I think there should be a version that doesn’t emit any HTML: for example, if the outermost element of a template is a template, built using wikitext syntax, you cannot just replace it with the parser function. In this case, a clean solution would be wrapping the table in a parser function that only sets the parser language (using heredoc for the parser function parameter to avoid having to escape the pipes of the table syntax), and manually setting the language on the table itself.

Even for less complex cases, being able to set the element (mostly span or div, but really anything) and attributes (class and style quite often, but also others) is important. Those could fit in the parser function syntax, but they don’t happen automatically.

Maybe it should be a parser tag rather than a parser function? While parser tags usually return content that’s transparent to templates processing the result, they can return [0 => $content, 'markerType' => 'none'] to avoid this behavior. Using a parser tag would:

Avoid the issues heredoc tries to address. As long as no </wraplang> appears in the content, it’s safe to put whatever content we want into it, including pipes and curly braces.
Allow specifying arbitrary attributes using a natural syntax.
Like parser functions and unlike regular HTML, still be processed early (templates processing the result would see the content already parsed in the right context) and not be implicitly closed (so we don’t have to worry about the content running to the end of the page because the editor forgot to close the tag).

So my proposal would be

<wraplang tag="div" lang="als" class="my-cool-class" style="float:right">
Alemannisch <!-- or toskërishtja? should `lang` be interpreted as a MediaWiki or as a BCP-47 language code? -->
</wraplang>

About the attributes,

lang should be required (a language code – it needs to be decided whether MediaWiki or BCP-47);
tag should be optional (an HTML element name allowed by MediaWiki) – if not specified, no HTML should be output (for the table example above);
dir should probably be forbidden (the parser tag handles it automatically);
all other attributes should be
- allowed, optional, and simply forwarded to the resulting HTML tag if tag is specified;
- forbidden if tag isn’t specified.

cscott mentioned this in T377726: Link trail does not respect page content language.Oct 31 2024, 2:54 PM

make Parser::getTargetLanguage aware of multilingual wikisOpen, HighPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

make Parser::getTargetLanguage aware of multilingual wikis
Open, HighPublic
Actions

Related Objects
Search...