The number of resource_change and resource_purge events can get extremely high, spiking at 10k req/sec at times. These spikes are presumably caused by changes to highly used templates. We should investigate strategies to mitigate this effect.
These events are used to purge cached copies of rendered page content from carious caches (Varnish, PCS/Cassandra, etc). This is needed to avoid users seeing vandalized content even after a malicious change to a template has been reverted.
Context
When a template is changed, we schedule several kinds of jobs that both recursively iterate over batches of pages that use the template in question: First HTMLCacheUpdateJob and RefreshLinksJob, then later CdnPurgeJob.
- HTMLCacheUpdateJob:
- RefreshLinksJob is responsible for updating derived data in the database, in particular any entries in the link tables (pagelinks, templatelinks, etc) associated with the affected page. This is done be re-parsing the page content (using the new version of the template). Note that the rendered output is currently not cached in the ParserCache, because we want the cache to be populated based on organic page view access patterns. But that could change in the future.
- HTMLCacheUpdateJob is responsible for invalidating the ParserCache and also for purging cached copies of the output from the CDN (Varnish) layer. It updates the page_touched field in the database and causes a CdnPurgeJob to be scheduled for any URLs affected by the change.
- CdnPurgeJob uses the EventRelayerGroup service to notify any interested parties that URLs need purging. In MWF production, this triggers CdnPurgeEventRelayer, which sends resource_change to the changeprop service which then emits resource_purge events to the CDN layer. It also sends no-cache requests so services that manage their own caches, like RESTbase and PCS.
Diagram: https://miro.com/app/board/uXjVKI3NmLw=/
resource_change and resource_purge are very spiky:
See https://grafana.wikimedia.org/goto/1d9qy64HR?orgId=1 and https://grafana.wikimedia.org/goto/94rZ86VHg?orgId=1
Ideas
- Generally rely on natural expiry of caches, which should happen after one day. Only trigger recursive purges in certain cases:
- if the template isn't used too much (maybe 100 times or so)
- when an admin explicitly requests a recursive purge (could be a button on the purge page, or a popup after a revert)
- after rollback/undo (unconditionally or optionally)
- if the template is unprotected (protected templates are unlikely to be vandalized).
- Avoid purges if the generated output for a given change didn't actually change due to the change to the template.
- Leave it to RefreshLinksJob to decide if a CdnPurgeJob is needed for a given page.
- This would delay purging quite a bit, since RefreshLinksJob is slow.
- Avoid purges based on traffix:
- only purge pages that were requested in the last 24h (using a join in Flink)
- only purge pages that are in the top n-th percentile of page views.