Page MenuHomePhabricator

CommonsMetadata extension causes every page on commons to be always parsed twice
Closed, ResolvedPublic

Description

With over 2 million duplicate parses a day this is by far the leading cause of duplicate parses.

Happens on a regular page view when the content was not served from parser cache.
First parse: page view parse
Second parse: CommonsMetadata::onSkinAfterBottomScripts extracts metadata from a File, which runs LocalFile::getDescriptionText which parses the file page.

Solution 1: LocalFile::getDescriptionText should use ParserOutputAccess instead of making an uncached parse.
Solution 2: CommonsMetadata extension was developed before structured data on commons and it's documentation states that it will be replaced by it. So, actually replace and undeploy CommonsMetadata extension.

Ideally, both of this things should be done.

Event Timeline

SkinAfterBottomScripts is being ran on every page view regardless if it's cached or not. It should try to access the parsed version first

SkinAfterBottomScripts is being ran on every page view regardless if it's cached or not. It should try to access the parsed version first

At least core's getExtendedMetadataFromHook caches it for a rather short time.

Change 726735 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/core@master] Use ParserCache for local file description renders

https://gerrit.wikimedia.org/r/726735

Commons might support structured data now but the number of files actually using it is a fraction of those which use a template that can be parsed by CommonsMetadata. Also, CommonsMetadata supports ForeignAPIFile/ForeignDBFile while structured data doesn't. So I doubt it can be replaced anytime soon.

In any case, as noted in the description, the parsing happens in LocalFile::getDescriptionText() so it's not really related to CommonsMetadata. (Also note that for Commons files, which are ForeignDBFile objects on most Wikimedia wikis and ForeignAPIFile objects on some, getDescriptionText() will be a HTTP call to the Commons file page instead, with action=render, with the object cache in front of it.)

FormatMetadata::fetchExtendedMetadata() (which is the only way to invoke the CommonMetadata logic which calls File::getDescriptionText()) uses a 12 hour cache for foreign files and a 30 day cache for local files, so it's not particularly short.

Back when this code was written, LocalFile::getDescriptionText() called Content::getParserOutput() and these days it uses RevisionRenderer. In theory both of those should be using the parser cache, not sure where that goes wrong.

Commons might support structured data now but the number of files actually using it is a fraction of those which use a template that can be parsed by CommonsMetadata. Also, CommonsMetadata supports ForeignAPIFile/ForeignDBFile while structured data doesn't. So I doubt it can be replaced anytime soon.

Ok, thank you for explanation.

Back when this code was written, LocalFile::getDescriptionText() called Content::getParserOutput() and these days it uses RevisionRenderer. In theory both of those should be using the parser the cache, not sure where that goes wrong.

RenderedRevision is basically a wrapper over Content::getParserOutput, but it's 'below' parser cache level. See attached patch that makes the getDescriptionText use ParserCache.

Hm. The attached patch is trying to use ParserOutputAccess to reuse ParserCache, but that won't work - we only cache the parses with canonical options, and here we want non-canonical. The ONLY difference between canonical vs non-canonical options is limit report - that little HTML comment we embed in the ParserOutput with Parser performance data. In practice, it doesn't quite matter - File::getDescriptionText is not used a lot, so switching it to canonical options and including the html comment is probably no a big deal. A better solution would be to solve T269234 first and eliminate this difference between canonical and non-canonical options.

Change 726735 merged by jenkins-bot:

[mediawiki/core@master] Use ParserCache for local file description renders

https://gerrit.wikimedia.org/r/726735

[mediawiki/core@master] Use ParserCache for local file description renders

https://gerrit.wikimedia.org/r/726735

From the diff:

LocalFile.php
/**
 * This is not used by ImagePage for local files, since (among other things)
 * it skips the parser cache.
 */
public function getDescriptionText();

Is this still true? If so, does it mean that there often still two parses? (E.g. one for whatever ImagePage does to display content, and one for the cache-miss call from CommonsMetadata to getDescriptionText)

Pchelolo claimed this task.

Screen Shot 2021-11-19 at 7.08.55 AM.png (418×1 px, 52 KB)

This worked :)