Page MenuHomePhabricator

Reduce DiscussionTools' usage of the parser cache
Closed, ResolvedPublic

Description

This task represents two streams of work:

  1. The longer-term work associated with reducing DiscussionTools' usage of the parser cache and
  2. The near-term work of modifying the parser cache expiry to reduce its usage so the Editing Team can proceed with scaling DiscussionTools features.

Timeline

  • 12 August: New host was made primary for pc1 in codfw.
  • 16/17 August: If things continue to look as good as they do right now, promote new hosts to primary for pc2 pc3 in codfw.
  • 18/19 August: Promote new hosts to primary for pc1/2/3 in eqiad.
  • Week of 30, August: Editing Team can resume DiscussionTool deployments (T288483)
    • Actual deployment: 31 August 2021
  • After T288483: Performance and Data Persistence Teams will consider raising the retention time again

Plans

This section contains the in-progress plan for reducing DiscussionTools' usage of the parser cache.

Near-term plan of reducing parser cache usage

STEPDESCRIPTIONTICKETSTATUS
Step 1a.Pre-deploy: Draft plan for interim mitigation with Performance-Team and DBA.(this one)
Step 1b.Pre-deploy: Write down how Performance-Team and DBA monitors outcome.T280602
Step 2.Execute mitigation plan.T280605
Step 3.Post-deploy: Evaluate impact on site performance (for at least 21 days).T280606
Step 4b.Post-deploy: Ramp up parser cache retention while keeping an eye on parser cache utilization.T280604

Longer-term plan of decreasing #discussion-tools's usage of parser cache

StepDescriptionTicketNotes
1Avoid splitting parser cache on user languageT280295
2Avoid splitting parser cache on opt-out wikisT279864
3Deploy to more wikis as opt-outT275256

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Krinkle @LSobanski: below is an update about the Editing Team's plans for scaling DiscussionTools features to more projects as opt-out settings.

The below is for y'alls awareness. I don't see anything about the below changing/impacting the steps we've defined in this "epic". Although, if you see this differently, please comment as much...

Editing Team's plan for scaling DiscussionTools features

  1. This week, we're beginning conversations with volunteers at ~25 Wikipedias inviting their feedback about our plans to offer the Reply Tool as an opt-out setting (T262331) at their project. We will not be making any commitments about specific deployment dates considering these dates depend on us resolving the parser cache utilization issue. T281533 captures the work involved with having said "conversations."
  2. Once we start receiving consent from wikis to turn the Reply Tool on by default, we'll comment here asking y'all about the parser cache's utilization status so we can, in turn, provide updates to projects about when they can potentially expect to see the Reply Tool available to everyone at their projects.
  3. Once we are comfortable with the parser cache's utilization, we'll proceed with offering the Reply Tool as an opt-out setting at the projects referenced in "2."

Update: 1 July 2021

Documenting the next steps that emerged in the two meetings related to this issue today...

Next steps

  • @Krinkle to verify whether the optimizations made in T282761 have been effective: T280606
  • @Krinkle to estimate the growth in demand for Parser Cache storage: T285993 / T280604
  • Editing Team to estimate the growth in DiscussionTools' demand for Parser Cache storage: T285995 [i]

i. The need for this estimate emerged in a second conversation between @DannyH, @marcella, @DAbad, and myself.

  • @Krinkle to verify whether the optimizations made in T282761 have been effective

That's T280606: Post-deployment: evaluate impact on site performance.

  • @Krinkle to estimate the growth in demand for Parser Cache storage: T285993

This'll be part of T280604: Post-deployment: (partly) ramp parser cache retention back up , moved as subtask there.

  • @Krinkle to verify whether the optimizations made in T282761 have been effective

That's T280606: Post-deployment: evaluate impact on site performance.

Noted.

  • @Krinkle to estimate the growth in demand for Parser Cache storage: T285993

This'll be part of T280604: Post-deployment: (partly) ramp parser cache retention back up , moved as subtask there.

Noted. Excellent. Thank you.

I fell down the rabbit hole of ParserCache when I was investigating for T285987: Do not generate full html parser output at the end of Wikibase edit requests (unrelated to discussion tools but related to ParserCache). I have some results, I would like to share, where should I post my numbers?

I don't know where to put this so I put my findings here. I did a sampling of 1:256 and checked the keys. In total we have 550M PC entries.

I'm struggling to see how discussion tools can cause issues for parsercache. Its current fragmentation is next to nothing (0.28% extra rows, currently around 1.4M rows). Maybe the reduction of expiry has helped but I would like to see some numbers on that.

The actual problem is parser cache entries of commons. It's currently 29% of all parser cache entries and over 160M rows. To compare, this is more than all PC entries of enwiki, wikidata, zhwiki, frwiki, dewiki, and enwiktionary combined. I think it's related to a bot that purges all pages in commons or can be due to refreshlinks jobs or cirrussearch sanitizer job misbehaving (or combination of all). This needs a way deeper and closer look.

Looking at commons a bit closer: Out of 160M entries:

  • 136M rows are non-canonical and only 24M rows are canonical
  • 100M rows have wb=3 on them. I don't know what wikibase is supposed to do on commons for parsercache but this doesn't sound right at all. We don't have new termbox there.
  • 108M are not render requests and 60M are render requests.
  • 52M are fragmentation due to user language not being English.
  • 39M rows are because of 'responsiveimages=0'.

I keep looking at this in more depth and keep you all posted.

Some random stuff I found:

  • People can fragment parsercache by choosing random languages. For example I found an entry with userlang=-4787_or_5036=(select_(case_when_(5036=4595)_then_5036_else_(select_4595_union_select_4274)_end))--_emdu
  • TMH seems to be using ParserCache as a general purpose cache in ApiTimedText. I found entries like commonswiki:apitimedtext:Thai_National_Anthem_-_US_Navy_Band.ogg.ru.srt:srt:srt there. This is not much but has potential to explode.
  • There is a general problem of bots editing pages and triggering a parsed entry while actually no one looking at them. e.g. ruwikinews a very small wiki in terms of traffic apparently now has 15M ParserCache rows (ten times bigger than all of discussion tools overhead) mostly because they recently imported a lot of news from an old place. We can rethink and maybe avoid parsing the page and storing PC if the bot flag is set.

I dig more and let you know.

100M rows have wb=3 on them. I don't know what wikibase is supposed to do on commons for parsercache but this doesn't sound right at all. We don't have new termbox there.

This is added by WikibaseRepo and will probably appear in ALL commons and wikidata (and the associated test site) pcache keys
https://github.com/wikimedia/Wikibase/blob/c1791fbca79be6f14b42a4117367ddaa1e618023/repo/includes/RepoHooks.php#L1069-L1073
Though this has consistnetly been 3 for years now, so no extra splitting should be happening here
We could probably drop this.

Not sure why only some % of commons entries seem to have this? the hook looks like it always adds it?
Could be something to do with MCR? not sure?

There is a general problem of bots editing pages and triggering a parsed entry while actually no one looking at them. e.g. ruwikinews a very small wiki in terms of traffic apparently now has 15M ParserCache rows (ten times bigger than all of discussion tools overhead) mostly because they recently imported a lot of news from an old place. We can rethink and maybe avoid parsing the page and storing PC if the bot flag is set.

This is also a problem for Wikidata, and we are going to stop this from happening T285987: Do not generate full html parser output at the end of Wikibase edit requests

Change 708520 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/extensions/TimedMediaHandler@master] Avoid using ParserCache as a general purpose cache

https://gerrit.wikimedia.org/r/708520

I'm struggling to see how discussion tools can cause issues for parsercache. Its current fragmentation is next to nothing

The fragmentation issue in DT was solved many months ago at the source already, and later with the reduced retention, so it is expected to be very low now.

commons. It's currently 29% of all parser cache entries and over 160M rows. To compare, this is more than all PC entries of enwiki, wikidata, zhwiki, frwiki, dewiki, and enwiktionary combined.

Thanks, this is very nice. We hadn't yet tried to break it down this way. Right now, though, I'd say we're not actively looking to decrease. Previous experience does tell us that even low hit rates are useful in PC given the high cost of generating them. I'm actually thinking about a possible future where PC is more like ExternalStore, in that it would not have a TTL at all, but basically append-only (apart from replacing entires with current revisions, and applying deletions). Especially as we get closer to Parsoid being used for page views, which has a relatively strong need to have an expansion ready to go at all times. As well as improving performance for page views more broadly by getting the miss-rate so low that we could potentially even serve an error if a PC entry is missing (and queue a job or something). This will require a lot more work, but it shows a rough long-term direction that I'm considering. (Nothing is decided on yet.)

Some random stuff I found:

  • People can fragment parsercache by choosing random languages. For example I found an entry with userlang=-4787_or_5036…

This is required for the int-lang hack. These should be given a shortened TTL, same as for old revisions and non-canonical preferences, but at least so long as we support this feature, still worth caching I imagine.

I'm hoping to, in the next 1-2 years, deprecate and remove this feature as it seems the various purposes for it have viable alternatives nowadays. It'll take a long time to migrate, but during the migration we could potentially disable caching at some point, or severely limit which wikis/namespaces it is cached for, and eventually disabled (e.g. normalised to a valid language code).

  • TMH seems to be using ParserCache as a general purpose cache in ApiTimedText. I found entries like commonswiki:apitimedtext:Thai_National_Anthem_-_US_Navy_Band.ogg.ru.srt:srt:srt there. This is not much but has potential to explode.

Ack. I think we may have one or two other things like this. These are basically using PC as if it is the MainStash, where we are currently short on space. Being worked on at T212129.

  • There is a general problem of bots editing pages and triggering a parsed entry while actually no one looking at them. e.g. ruwikinews a very small wiki in terms of traffic apparently now has 15M ParserCache rows (ten times bigger than all of discussion tools overhead) mostly because they recently imported a lot of news from an old place. We can rethink and maybe avoid parsing the page and storing PC if the bot flag is set.

As mentioned above, PC benefits a lot from the long-tail. So intentional measures not to pre-cache entries during edits would affect performance of API queries, Jobs, and eventually page views. It may be good to have this as one of several emergency levers we can pull to reduce load, but I'm not sure about it in general.

In general though, I think right now we're stable and I'd prefer not to make major changes to demand if we can avoid it until this task and its subtasks are completed.

In general though, I think right now we're stable and I'd prefer not to make major changes to demand if we can avoid it until this task and its subtasks are completed.

Sorry if this should be obvious, @Krinkle: Can you tell me whether deploying DiscussionTools to ~20 Wikipedias (not English) would constitute "making major changes"?

My comment was mainly in response to Amir.

For DiscussionTools, I believe the plan is first let ParserCache switch to the new hardware. These new servers are online as of last month, and are being warmed as of last week (satisfying linegraph). It it keeps growing as it has been, it should converge within the next 4-5 days. If nothing unexpected comes up before then, I imagine we'll greenlight DT, and then keep working on T280604 and T280606 in parallel during the weeks that follow.

I've confirmed the above with @Marostegui. He expects the new hardware to be warmed up by August 12 and pooled/switched out on that day.

I've confirmed the above with @Marostegui. He expects the new hardware to be warmed up by August 12 and pooled/switched out on that day.

Ish :) I plan to make one of the new hosts a pc primary on thursday (aug 12th), and let it run over the weekend for observation. If all goes well, i'll start promoting the other new hosts the following week. We'll probably want another week or two after that before we start changing anything else, so that we have some historical data with the new hosts as a basis of comparison.

I've confirmed the above with @Marostegui. He expects the new hardware to be warmed up by August 12 and pooled/switched out on that day.

Ish :) I plan to make one of the new hosts a pc primary on thursday (aug 12th), and let it run over the weekend for observation. If all goes well, i'll start promoting the other new hosts the following week. We'll probably want another week or two after that before we start changing anything else, so that we have some historical data with the new hosts as a basis of comparison.

hi @Kormat: two questions for you:

  1. Are you able to give the provisional [i] timeline below a quick read and tell me if you see anything unexpected about it?
  2. Is T284825 the best ticket for us to follow to stay updated on the Parser Cache host transitions you are referencing above?

Timeline [i]

  • Today, 12 August: One of the new hosts is made a primary Parser Cache host
  • Tuesday, 16 August: Determine whether the initial new host transition was successful.
  • Monday, 23 August: Remaining new hosts are made to be primary Parser Cache hosts.
    • This assumes the initial transition that took place on 12 August was successful.
  • Week of 6, Sep: Editing Team can resume DiscussionTool deployments (T288483)
    • This assumes the subsequent transition that started on 23 August was successful.

i. Please know the Editing Team appreciates these dates are subject to change; we do not want putting specific dates to these milestones to suggest we are depending on the timeline playing out precisely as described!

@ppelberg I think the one item missing from the timeline above is the increase of the retention period and bake time to see how that works with the new hosts / purge script updates in place. @Krinkle is probably the best person to confirm whether that's the case and suggest appropriate timeline entries

Hi @ppelberg, given how things are looking right now (T284825#7281001), i'd give a slightly more accelerated timeline.

  • 12 August - new host was made primary for pc1 in codfw.
  • 16/17 August - if things continue to look as good as they do right now, promote new hosts to primary for pc2 pc3 in codfw.
  • 18/19 August - promote new hosts to primary for pc1/2/3 in eqiad.
  • Week of 30 August - Editing Team can resume DiscussionTool deployments (T288483: Deploy config to make Reply Tool available as opt-out at phase 2 wikis)

After you've finished your phase 2 deployment, we can then consider raising the retention time again. Doing it in this order (as opposed to raising retention time before deploying DT further) means we have finer-grained control over things. It's very easy to slowly increase retention time and see what effect happens, and decrease it again if necessary.

I hope that sounds reasonable :)

Hi @ppelberg, given how things are looking right now (T284825#7281001), i'd give a slightly more accelerated timeline...

All that you described sounds reasonable to us, @Kormat – we appreciate you providing this additional context!

I've updated the task description with the timeline you shared in T280599#7281173 so that you, @LSobanski, @Krinkle, and anyone else, can update it as new information emerges.

cc @LZaman @Whatamidoing-WMF

@Kormat / @Krinkle: are y'all able to share what – if any – impact T288998 has on the timeline in the task description?

@ppelberg: short answer: no impact expected.

Longer answer:
With the change in T288998 reverted, the disk usage is reverting back to the norm. There's another change which is likely to get rolled out Soon (T285987), which should also have the effect of decreasing PC usage. While in an ideal world it would be nice to have only 1 thing changing PC at the same time, the most important thing is if a significant increase in usage occurs being able to pinpoint the source.

Also, you folks have been very patient, and i don't want to delay you any further without a compelling reason :)

One suggestion I had was that if it's enabled by default for everyone, you can make the default PC option to true (and set it to false when it's not enabled like by user preference) and it would reduce the fragmentation

@ppelberg: short answer: no impact expected.

Wonderful 😅

Longer answer:
With the change in T288998 reverted, the disk usage is reverting back to the norm. There's another change which is likely to get rolled out Soon (T285987), which should also have the effect of decreasing PC usage. While in an ideal world it would be nice to have only 1 thing changing PC at the same time, the most important thing is if a significant increase in usage occurs being able to pinpoint the source.

Mmm, I see. This context helps me to better understand how the deployment and spike do and do not relate.

Also, you folks have been very patient, and i don't want to delay you any further without a compelling reason :)

We appreciate you saying as much; I also appreciate how communicative y'all have been.

One suggestion I had was that if it's enabled by default for everyone, you can make the default PC option to true (and set it to false when it's not enabled like by user preference) and it would reduce the fragmentation

Yes, this is already the case since T279864.

Maybe my understanding of ParserCache is flawed (which is very possible) but I think this wouldn't reduce it as the default value for that wiki is still null (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/ /680316/5/includes/Hooks/ParserHooks.php#72) meaning it won't be still treated as canonical PC entry (and you should flip the condition setting it to false when the default becomes true). Have you tested it and its impact?

The problem is that DerivedPageDataUpdater::doParserCacheUpdate creates a canonical parser cache entry so with edit you create one without dtreply=1 and when the user refreshes the page or checks the page, you re-render it with dtreply=1 and store both in ParserCache

This is my understanding from reading the code and looking at keys. I might be very wrong here. Also, keep it in mind that this switch to canonical must happen after all PC entries have been expired so ten days or a month after deployment.

Sorry, you're right and I was wrong, I misunderstood what you meant by "default". (And I also wasn't aware that a parser output that ignores our 'ArticleParserOptions' hook handler is always generated.)

Althought I did not test it either, I also just read the code.

@Ladsgroup So… should we actually just remove our parser cache option completely?

Change 713681 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Always apply DiscussionTools page transformations

https://gerrit.wikimedia.org/r/713681

Change 714716 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/extensions/DiscussionTools@master] Remove parser cache splitting ('dtreply' option)

https://gerrit.wikimedia.org/r/714716

Let me know if these patches make sense, and if/when we should aim to deploy them.

Thanks. I take a look at it ASAP. I have a quite busy day today and tomorrow so I can't promise :(

@Kormat, a question: are y'all comfortable with us coupling with two other small Discussion Tool deployments [i][ii] with the deployment of T288483 we have scheduled for next week?


i. T284339: Offer the Reply Tool as opt-out setting at Wikimania wiki
ii. T285162: Add DiscussionTools extension in ptwikinews

Not @Kormat but I'll ask a few clarifying questions:

  • Do we know if these will create additional load on Parsercache and if yes, how would it compare to what T288483 adds?
  • If these are deployed at the same time and we'll see problems, we'll have to roll all of them back, is that feasible / acceptable?
  • They're tiny compared to the wikis in T288483. Portuguese Wikinews is smaller than almost all of the Wikipedias listed there. Wikimania wiki is completely inactive, because Wikimania already ended, and we weren't able to enable the reply tool in time to be useful this year…
  • Yes

Looks good to me then but I'd still like @Kormat to weigh in.

An update and a question...

Update: today, 31 August, the Reply Tool became available as an on-by-default feature at the 21 Wikipedias listed in T288483. @Addshore, @Krinkle, @Kormat, @Ladsgroup, @LSobanski, we appreciate you all helping to make this possible.

Resulting question: @Kormat/@Krinkle/@LSobanski, now that T28060 is resolved, would it be accurate for the Editing Team to think that T280604 is all that's left to be resolved before we can move forward with additional DiscussionTool deployments? [i] Are there any other steps to be taken, questions to be answered, etc.?


i. E.g.T273072, T288485, T284339, and T285162

Update: today, 31 August, the Reply Tool became available as an on-by-default feature at the 21 Wikipedias listed in T288483. @Addshore, @Krinkle, @Kormat, @Ladsgroup, @LSobanski, we appreciate you all helping to make this possible.

🎉 You're welcome!

Resulting question: @Kormat/@Krinkle/@LSobanski, now that T28060 is resolved, would it be accurate for the Editing Team to think that T280604 is all that's left to be resolved before we can move forward with additional DiscussionTool deployments? [i] Are there any other steps to be taken, questions to be answered, etc.?


i. E.g.T273072, T288485, T284339, and T285162

From my perspective, the Editing Team should feel free to move forward without waiting for T280604. The impact of the lowered retention period seems to be very minor (T280606#7324071). I'll let @Krinkle weigh in on that, though, in case i'm missing something.

Resulting question: @Kormat/@Krinkle/@LSobanski, now that T28060 is resolved, would it be accurate for the Editing Team to think that T280604 is all that's left to be resolved before we can move forward with additional DiscussionTool deployments? [i] Are there any other steps to be taken, questions to be answered, etc.?


i. E.g.T273072, T288485, T284339, and T285162

From my perspective, the Editing Team should feel free to forward without waiting for T280604. The impact of the lowered retention period seems to be very minor (T280606#7324071). I'll let @Krinkle weigh in on that, though, in case i'm missing something.

Excellent, okay. Thank you, @Kormat. We'll standby for input from Timo.

would it be accurate for the Editing Team to think that T280604 is all that's left to be resolved before we can move forward […] ?

Per previous comments T280599#7264655 and T280606#7323632, no this is not blocking rollout and can continue in parallel with your roll out. DT is a go from my perspective.

would it be accurate for the Editing Team to think that T280604 is all that's left to be resolved before we can move forward […] ?

Per previous comments T280599#7264655 and T280606#7323632, no this is not blocking rollout and can continue in parallel with your roll out. DT is a go from my perspective.

Excellent. Thank you for confirming, @Krinkle

In line with the above, I'm going to remove this task as blocking T288484, T273072, and T284339.

Should anything change, please ping us here.

Just to raise awareness, I'll mention that the desktop refresh project plans on making changes to ToC representation in the generated HTML. That *may* change representation in the ParserCache or even split it, depending on how some parts of the design work out. Let me know if you want to be added to the planning discussion.

Change 713681 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Always apply DiscussionTools page transformations

https://gerrit.wikimedia.org/r/713681

We've merged https://gerrit.wikimedia.org/r/713681, so now DiscussionTools' markup will be added to all talk pages. This will temporarily increase the parser cache usage a little (I don't have the data to estimate how much… but probably not much), but will allow us to remove the parser cache split.

In a week or two, once that data is cached, we'll merge https://gerrit.wikimedia.org/r/714716 to stop splitting the parser cache. Once the old entries entries expire, this will permanently reduce the parser cache usage somewhat (I also don't have the data here, probably also not that much overall).

And then I suppose we can resolve this task.

Change 714716 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Remove parser cache splitting ('dtreply' option)

https://gerrit.wikimedia.org/r/714716

ppelberg claimed this task.

Just to raise awareness, I'll mention that the desktop refresh project plans on making changes to ToC representation in the generated HTML. That *may* change representation in the ParserCache or even split it, depending on how some parts of the design work out. Let me know if you want to be added to the planning discussion.

Please keep me in loop regarding these changes. Thanks!