Page MenuHomePhabricator

Reduce the number of resource_change and resource_purge events emitted due to template changes
Open, Needs TriagePublic

Description

The number of resource_change and resource_purge events can get extremely high, spiking at 10k req/sec at times. These spikes are presumably caused by changes to highly used templates. We should investigate strategies to mitigate this effect.

These events are used to purge cached copies of rendered page content from carious caches (Varnish, PCS/Cassandra, etc). This is needed to avoid users seeing vandalized content even after a malicious change to a template has been reverted.

Context

When a template is changed, we schedule several kinds of jobs that both recursively iterate over batches of pages that use the template in question: First HTMLCacheUpdateJob and RefreshLinksJob, then later CdnPurgeJob.

  • HTMLCacheUpdateJob:
  • RefreshLinksJob is responsible for updating derived data in the database, in particular any entries in the link tables (pagelinks, templatelinks, etc) associated with the affected page. This is done be re-parsing the page content (using the new version of the template). Note that the rendered output is currently not cached in the ParserCache, because we want the cache to be populated based on organic page view access patterns. But that could change in the future.
  • HTMLCacheUpdateJob is responsible for invalidating the ParserCache and also for purging cached copies of the output from the CDN (Varnish) layer. It updates the page_touched field in the database and causes a CdnPurgeJob to be scheduled for any URLs affected by the change.
  • CdnPurgeJob uses the EventRelayerGroup service to notify any interested parties that URLs need purging. In MWF production, this triggers CdnPurgeEventRelayer, which sends resource_change to the changeprop service which then emits resource_purge events to the CDN layer. It also sends no-cache requests so services that manage their own caches, like RESTbase and PCS.

Diagram: https://miro.com/app/board/uXjVKI3NmLw=/

resource_change and resource_purge are very spiky:

grafik.png (835×1 px, 420 KB)
grafik.png (855×1 px, 624 KB)

See https://grafana.wikimedia.org/goto/1d9qy64HR?orgId=1 and https://grafana.wikimedia.org/goto/94rZ86VHg?orgId=1

Ideas

  • Generally rely on natural expiry of caches, which should happen after one day. Only trigger recursive purges in certain cases:
    • if the template isn't used too much (maybe 100 times or so)
    • when an admin explicitly requests a recursive purge (could be a button on the purge page, or a popup after a revert)
    • after rollback/undo (unconditionally or optionally)
    • if the template is unprotected (protected templates are unlikely to be vandalized).
  • Avoid purges if the generated output for a given change didn't actually change due to the change to the template.
    • Leave it to RefreshLinksJob to decide if a CdnPurgeJob is needed for a given page.
    • This would delay purging quite a bit, since RefreshLinksJob is slow.
  • Avoid purges based on traffix:
    • only purge pages that were requested in the last 24h (using a join in Flink)
    • only purge pages that are in the top n-th percentile of page views.

Event Timeline

Change #1053907 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] RefreshLinksJob: collect stats on redundant parses

https://gerrit.wikimedia.org/r/1053907

Umherirrender renamed this task from Redunce the number of resource_change and resource_purge events emitted due to template changes to Reduce the number of resource_change and resource_purge events emitted due to template changes.Jul 12 2024, 5:19 PM

"if the template is unprotected (protected templates are unlikely to be vandalized)."

But if a protected template is vandalized then the vandalism has a much broader scale.

The English Wikipedia has had LTAs making 500 edits and waiting a month just to vandalize protected templates, for example https://en.wikipedia.org/wiki/Special:Contributions/CheezDeez32. But that tends to already generate complaints about the vandalism hours after it was reverted like https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)/Archive_212#Strange_bug_on_Flag_of_Russia_article even with the recursive purges. But I guess if the initial edit didn't purge things then the revert won't need to either.

when an admin explicitly requests a recursive purge (could be a button on the purge page, or a popup after a revert)

I suspect most vandalism patrollers will have no idea what they are being asked to answer.


I think the idea is probably fine, but you should at least think about how to handle cleanup after cases like the one I linked.

Another option is to subdivide pages into two categories, "high traffic pages" and "long tail low traffic pages". The latter would be put effectively into a no-cache state: the cache lifetime would be very short, and we would never emit purges for them, relying on the natural expiration to deal with vandalism. We'd only emit purges for the high traffic pages.

This is similar to the last item in the "ideas" category except for statically assigning pages to a category (instead of dynamically) and the addition of a no-cache (or limited-lifetime-cache) header for the "low traffic pages". Newly-created pages would probably default to the "high traffic"/"precise purging" category, and we could run a job "every so often" to reasssign the categories, emitting a purge for any page moved from "high traffic" to "low traffic". Nothing needs to be done for a page moved from low traffic to high.

The goal is to increase the amount of /useful work/ done by the purge traffic: increasing the chances that the thing being purged is (a) actually in a cache, and (b) will be viewed from the cache before cache expiration.

The number of resource_change and resource_purge events can get extremely high, spiking at 10k req/sec at times

I'm curious about the the problem that this causes. Too many jobs inserted for job queue to handle quickly enough? Too many purge requests at once?

For completeness, another option is the varnish "x-key" system, which involves two research projects. One is that implementation of x-key in varnish appears to be incomplete, and the second is that the assignment of appropriate x-keys to URLs is non-trivial as well. There are too many templates used on a page like [[Barack Obama]] to naively assign one x-key to every recursively-included template, so we still need to come up with a mechanism to determine which of the templates deserve an x-key assigned, likely based on purge statistics.

Change #1053907 merged by jenkins-bot:

[mediawiki/core@master] RefreshLinksJob: collect stats on redundant parses

https://gerrit.wikimedia.org/r/1053907

The number of resource_change and resource_purge events can get extremely high, spiking at 10k req/sec at times

I'm curious about the the problem that this causes. Too many jobs inserted for job queue to handle quickly enough? Too many purge requests at once?

In the case of RESTBase (and the api server cluster) outages or at least incidents. One such incident is documented here https://wikitech.wikimedia.org/wiki/Incidents/2022-03-27_api. It's one of the reasons that I am happy that we are removing RESTBase from the infrastructure.

Looking at metrics from LinksUpdate, it seems that we could reduce the number of purges a lot, if we wait until after we re-parse to decide whetehr we need to purge or not.

grafik.png (410×622 px, 41 KB)

On average, we see

  • 33 times per second we find a new rendering already cached. This is probably mostly from direct edits. We'd still want to trigger purges on direct edits, immediately.
  • 16 times per second, the re-parse generates the exat same HTML as before. We could skip the purge in this case.
  • 4 times per seconf we find that the HTML actually changed. So we'd have to purge.
  • 61 we don't find anything in the parser cache. There is probably nothing in the edge caches in that case, but we'd still have to notifiy persistant caches (in RESTbase/PCS)

So, of the 114 purges per second, we could:

  • skip 16 entirely
  • send 61 only to services, not Varnish/ATS

Note that this is decoupling the reparsing/ParserCache invalidation done by the RefreshLinksJobs (which is mandatory and must be done) from the front-end cache (CDN/PCS) invalidation (which is optional and may be skipped).

One option may be to add a checkbox to allow admins to purge the CDN for templates (maybe titled something like "make effective immediately"); for non-admins the CDN purge would (always? usually?) be skipped.

This might be a reasonable pair of hypotheses for annual planning:

  1. Instrument edits to identify edits with the highest rank (edit frequency * usage count * % no op edits) and analyze to determine edit reason and whether website scalability could be improved by omitting CDN purges for these edits.

and then if this identified a potential benefit, something like:

  1. By exposing an editor option to bypass CDN purges, improve scalability of our sites.

For completeness, another option is the varnish "x-key" system, which involves two research projects. One is that implementation of x-key in varnish appears to be incomplete, and the second is that the assignment of appropriate x-keys to URLs is non-trivial as well. There are too many templates used on a page like [[Barack Obama]] to naively assign one x-key to every recursively-included template, so we still need to come up with a mechanism to determine which of the templates deserve an x-key assigned, likely based on purge statistics.

I doubt that assigning x-keys to templates would work well - you might risk to have to purge 100k urls in one request, which would probably cause some perf issues on the caching layer. And, keep in mind, we definitely need to first invalidate the cache in the backend for all those pages, *then* purge the edge - so keeping the logic as it is today is probably necessary.

Tagging pages in cache with just a page-specific x-key will be very useful to reduce the volume of our purges and allow more services to use the cache as well without growing the number of purges we have to send.

Right now, we send purges for each edit as follows:

  • the main on-wiki url
  • the main on-wiki url with action=history
  • the mobile on-wiki url
  • the restbase url for page/html

and maybe more. So just implementing x-key for pages would reduce immediately our purge rate by a factor 4 or more - while unlocking the possibility to add more urls derived from an article to the persistent cache.

That's definitely something we should aim to do in the future, but as you point out there's many unknowns to such a project, which would be independent / orthogonal to the aim of this task.

Right now, we send purges for each edit as follows:

  • the main on-wiki url
  • the main on-wiki url with action=history
  • the mobile on-wiki url
  • the restbase url for page/html

With RESTBase sunsetting underway, the above 4 variants should go back to the state of WMF in 2014 with 3 URLs.

And if T214998: RFC: Remove .m. subdomain, serve mobile and desktop variants through the same URL gains traction, that would restore further back the 2012 state of WMF with only MediaWiki's canonical default of 2 URLs.

I don't see how we can stop caching and purging the rest api urls.

I don't see how we can stop caching and purging the rest api urls.

We are currently not using the CDN cache for the new page/html endpoints. This may however change when they get used more.

In general, the access pattern for these APIs is much more "flat" than organic access, since they tend to be used bycrawlers. So they benefit a lot less from caching, the hit rate is likely low.

I don't see how we can stop caching and purging the rest api urls.

We are currently not using the CDN cache for the new page/html endpoints. This may however change when they get used more.

In general, the access pattern for these APIs is much more "flat" than organic access, since they tend to be used bycrawlers. So they benefit a lot less from caching, the hit rate is likely low.

We are not mostly because we don't want to add yet another URL to purge, which IMHO shows we really need to research doing purge-by-tag, which would allow us to cache many more non-canonical URLs.