Page MenuHomePhabricator

Use index on rc_this_oldid
Closed, ResolvedPublic

Description

There was a discussion in T138444: Use rc_timestamp index when joining to ores_classification about queries in recentchanges that are slow. It seems most of these issues can be resolved by adding an index on rc_ths_oldid but before moving on I want to know what ArchCom and mediawiki developers think of this in general.

Event Timeline

I'm ok with this I will only chime later when it has to be deployed.

@hoo you told me that we have a few places in core that work around the fact that we currently don't have an index on rc_this_oldid. Can you tell us what they are?

In T138444#2404904 (a different task), @jcrespo wrote:

[The recentchanges table needs some clean up] Assuming this is fixed/mitigated: T132416 and the change (being core) has consensus of all developers (please seek it), the actual implementation, thanks to MariaDB 10 is easy-ish - around a week of work with little to no impact once everything is prepared.

@jcrespo: could you clarify what exactly you're hoping to be changed before adding an index, and what you're looking for developer consensus about? In particular, clarification about what situation you're trying to avoid would be helpful.

clarification about what situation you're trying to avoid would be helpful.

I am trying to avoid mediawiki maintenance hell, aka T132416. Or situations such as T139056#2419033. Or me going after someone to ask for any kind of testing process to its changes T69223#1897163. People adding and removing indexes without being tracked properly and without the due procedure has lead to those issues. When people ask me why it takes so much time to do schema changes, most of the time it is to workaround existing issues with the schema, and I see not many people caring about those.

The requested workflow for a schema change is relatively clear: https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change

Without developer consensus, an index can be added one day and removed the next day. That may be possible to do on code, but it cannot be deployed into production. Consensus doesn't mean there needs to be any 1-hour discussion about it, I would be happy with: 1) one or several people (if possible, with more mediawiki experience) not part of the team that wants the change done has reviewed the change and see it as a good thing 2) testing has been done that proofs that this state is better than the previous one 3) agreement that the objects affected will not be changed for (this is an arbitrary duration) 6 months because it is an stable feature 4) we wait for any potential change on the same object, unrelated or not, to be done at the same time to minimize maintenance overhead (e.g. 2 table indexes done at the same time).

Reviewing changes and doing maintenance may not be as interesting as creating new features, but someone has to do it because if not it only increases technical debt. Maybe the questions is because you lack a mediawiki contributor that has knowledge of databases that can review that?

Maybe the questions is because you lack a mediawiki contributor that has knowledge of databases that can review that?

Note, I was asking my questions not as a member of the team working on MediaWiki-extensions-ORES, but as a member of TechCom trying to understand how to prioritize this TechCom-RFC. Thankfully, your comment provided just the clarification I was hoping for....

I am trying to avoid mediawiki maintenance hell, aka T132416. Or situations such as T139056#2419033. Or me going after someone to ask for any kind of testing process to its changes T69223#1897163. People adding and removing indexes without being tracked properly and without the due procedure has lead to those issues. [...] Reviewing changes and doing maintenance may not be as interesting as creating new features, but someone has to do it because if not it only increases technical debt.

Thanks for spelling this all out. I think there's a lot of information in the [your full comment](T139012#2420892) as well as our 2012 conversation when we first moved to frequent deploys, and it may be time for us to update our development process to reflect what we know now.

Danny_B renamed this task from [RFC] Use index on rc_this_oldid to Use index on rc_this_oldid.Jul 3 2016, 10:46 AM
Danny_B added a project: Proposal.

@hoo you told me that we have a few places in core that work around the fact that we currently don't have an index on rc_this_oldid. Can you tell us what they are?

I checked and it seems using "rc_timestamp" hack directly used in these places:

  1. On RevDelLogItem.php. Adding index on rc_this_oldid won't help since this one need index on rc_logid
  2. On maintenance/rebuildrecentchanges.php. This one obviously benefits from adding the index.

I've found some usages of index on rc_timestamp which seems to be unrelated:

  1. On ApiQueryRecentChanges.php

These ones indirectly use rc_timestamp index to get faster results: (Per source code RecentChange::newFromConds passes conds to db query.) It seems rc_timestamp is not used and it's there just for sake of the index. Note that in these cases rc_timestamp is used once. Search for it.

  1. On Article.php
  2. On Revision.php

I'm not an expert in mediawiki so maybe I'm wrong. Please correct me If I am.

Two main questions from a brief discussion in an architecture meeting:

  • Due to the overhead involved with changing indexes, are there additional indexes we want to add to the recentchanges table?
  • Should the requested index only cover the rc_this_oldid column, or additional columns? It is not a unique column (e.g. redirects, categorisation). However, it may not matter if if all use cases for querying this index involve wanting all matches.

Two main questions from a brief discussion in an architecture meeting:

  • Due to the overhead involved with changing indexes, are there additional indexes we want to add to the recentchanges table?

Per what I lined above I think having another index but on "rc_logid" is necessary too. I think this one should be on one column.

  • Should the requested index only cover the rc_this_oldid column, or additional columns? It is not a unique column (e.g. redirects, categorisation). However, it may not matter if if all use cases for querying this index involve wanting all matches.

I have two possible options in mind: 1- 'rc_this_oldid, rc_timestamp': This one makes queries in mediawiki core faster but not much 2- 'rc_this_oldid, rc_last_oldid' this one makes queries about diffs much faster but I couldn't find any use case in mediawiki/core.
As developer of ORES extension, the first option will help me greatly.

During the ArchCom meeting on May 31, it was agreed that this RFC be put on Last Call: if no new pertinent concerns are raised and remain ope by June 14, this RFC will be approved for implementation.

Beyond concerns regarding this RFC, it would also be useful to mention any other modifications to the recentchanges table that could be applied together with the new index.

Since no objections were raised within the final call period, this has RFC been approved by ArchCom.

10 indexes on a single table, is that a record? :-)

Well, It's one of the most used tables. Making it fast would benefit everyone (I know it's bad in matter of storage especially InnoDB doesn't compress indexes) but I'm confident advantages outweigh disadvantages.

It seems like this is no longer blocked on anything.

@Ladsgroup Is the proposal for this index still up to date and needed? From a quick glance, it seems there is still not an index on rc_this_oldid.

Yes, it only needs implementation, should I make another ticket for that?

Feel free to use this ticket for that. It's no longer on TechCom-RFS's main workboard.

These ones indirectly use rc_timestamp index to get faster results: (Per source code RecentChange::newFromConds passes conds to db query.) It seems rc_timestamp is not used and it's there just for sake of the index. Note that in these cases rc_timestamp is used once. Search for it.

  1. On Article.php

That USE INDEX is being passed to a dbtype parameter... See T192584

I implement this in order to improve performance of ores extension.

Change 450083 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/core@master] Add index on rc_this_oldid

https://gerrit.wikimedia.org/r/450083

Change 450083 merged by jenkins-bot:
[mediawiki/core@master] Add index on rc_this_oldid

https://gerrit.wikimedia.org/r/450083

Change 554521 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Remove hacks for lack of index on rc_this_oldid

https://gerrit.wikimedia.org/r/554521

Change 554521 merged by jenkins-bot:
[mediawiki/core@master] Remove hacks for lack of index on rc_this_oldid

https://gerrit.wikimedia.org/r/554521