Page MenuHomePhabricator

Refactor comment storage in the database and abstract access in MediaWiki
Closed, ResolvedPublic

Description

This is the task to implement T153333: RFC: How should we store longer revision comments?, now that the request has been closed. The text below is my current plan for the code I need to write.

Database

We'll create a new comment table that looks like this:

CREATE TABLE /*_*/comment(
  comment_id bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  comment_text BLOB NOT NULL,
  comment_data BLOB NOT NULL
) /*$wgDBTableOptions*/;
CREATE INDEX /*i*/comment_text ON comment (comment_text(100));

Comments stored in this table will be deduplicated. Replication to Labs will need to hide rows that aren't referred to by any of the tables listed below, or that are only referred to by the tables below in rows with revision-deleted comments. I'll provide more details on that in the schema change subtask, once I get to that point.

The end state will be to make the following changes to existing tables to refer to this new table instead of storing the comment directly:

  • revision.rev_commentrevision.rev_comment_id
  • archive.ar_commentarchive.ar_comment_id
  • ipblocks.ipb_reasonipblocks.ipb_reason_id
  • image.img_descriptionimage.img_description_id
  • oldimage.oi_descriptionoldimage.oi_description_id
  • filearchive.fa_deleted_reasonfilearchive.fa_deleted_reason_id
  • filearchive.fa_descriptionfilearchive.fa_description_id
  • recentchanges.rc_commentrecentchanges.rc_comment_id
  • logging.log_commentlogging.log_comment_id
  • protected_titles.pt_reasonprotected_titles.pt_reason_id
  • cu_changes.cuc_commentcu_changes.cuc_comment_id (from CheckUser)
  • cu_log.cul_reasoncu_log.cul_reason_id (from CheckUser)
    • These two should be ignored when determining visibility in the Labs replication
  • global_block_whitelist.gbw_reasonglobal_block_whitelist.gbw_reason_id (from GlobalBlocking)
  • globalblocks.gb_reasonglobalblocks.gb_reason_id (from GlobalBlocking)
  • flow_revision.rev_contentflow_revision.rev_content_id (from StructuredDiscussions)

However, since schema changes on large busy tables are a lot of work that we don't want to be blocked on for months, the revision and image tables will use a temporary auxiliary table that will later be merged into the main tables:

CREATE TABLE /*_*/revision_comment_temp (
  revcomment_rev bigint unsigned NOT NULL,     -- → revision.rev_id
  revcomment_comment bigint unsigned NOT NULL, -- → comment.comment_id
  PRIMARY KEY (revcomment_rev, revcomment_comment)
) /*$wgDBTableOptions*/;

CREATE TABLE /*_*/image_comment_temp (
  imgcomment_image varchar(255) binary NOT NULL, -- → image.img_name, ugh
  imgcomment_comment bigint unsigned NOT NULL,   -- → comment.comment_id
  PRIMARY KEY (imgcomment_image, imgcomment_comment)
) /*$wgDBTableOptions*/;

MediaWiki code

There will be a temporary feature flag with four states:

  • Read and write the old columns only.
  • Write both the old and new columns. Read from new preferentially, falling back to old.
  • Write only the new columns. Read from new preferentially, falling back to old.
  • Read and write the new columns only.

There will be a class that manages the feature flag and such. Methods it'll probably need:

  • getFields( $key ): Pass e.g. rev_comment, returns the fields that should be selected. The comment might have to be lazy-loaded later.
  • getJoin( $key ): Pass e.g. rev_comment, returns the tables to join, join conditions, and fields that should be selected.
  • insert( $dbw, $key, $comment ): Inserts the comment into the comment table, if applicable and necessary, and returns a mapping of field → value to be included in the insert into the main table.
    • I'll want a version of this that takes some LogEntry-like class (or maybe just use LogEntry?) and generates the comment_text and comment_data from it.
  • getComment( $key, $row ): Returns the comment string (or message?). May need to make a DB query if getFields() was used instead of getJoin().

Then I'll need to search the code for references to the old columns and patch them to use this new class.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 357599 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Add constants for schema migration feature flags

https://gerrit.wikimedia.org/r/357599

@jcrespo: It'd be nice to prevent duplicate rows in these temporary tables at the database level as well as the application level, which means a unique index on the revcomment_rev column. Which do you think would be best?

  1. Primary key on (revcomment_rev, revcomment_comment), unique index on (revcomment_rev).
  2. Primary key on (revcomment_rev), covering non-unique index on (revcomment_rev, revcomment_comment).
  3. Primary key on (revcomment_rev), no covering index.

When I test these on a small local database (around 4000 rows), EXPLAIN on #1 uses a 'ref' on PRIMARY with "Using index" while the other two both prefer an 'eq_ref' on PRIMARY with no "Using index".

I will answer based on certain assumptions, please tell me if those are wrong:

Most, if not all of the accesses will be from revision to comment. There may be accesses from comment to revision, but probably for maintenance or analytics tasks. Basically, I expect queries to be:

SELECT ... FROM revision JOIN revision_comment_temp ON revision.rev_id = revcomment.revcomment_rev JOIN comment on revcomment.revcomment_rev = comment.comment_id WHERE /* filter on revision */

n-n relationship tables are normally setup with a covering PK in the order they are most frequently access, that would be (revcomment_rev, revcomment_comment), that would allow them to be unique, clustered by the main access, and using a covering index. If the other access is needed, create an additional index on (revcomment_comment) or on (revcomment_comment, revcomment_rev), but for this second one, I would wait for it to be necesary, for queries like:

SELECT ... FROM comment JOIN comment on revcomment.revcomment_rev = comment.comment_id JOIN revision ON revision.rev_id = revcomment.revcomment_rev WHERE /* filter on comment */

Where it is expected for the join to go in the opposite direction. I would assume that is not currently the case because I think there is currently no index on comment, but please tell me otherwise, as that will change my suggestion. I strongly suggest to not add such a possibility on a first step (it is ok, if it is for maintenance/migration purposes).

Most, if not all of the accesses will be from revision to comment. There may be accesses from comment to revision, but probably for maintenance or analytics tasks.

For production usage, all accesses will be from revision to comment.

In addition to potential maintenance or analytics tasks, Labs replica filtering will probably need to make accesses from comment to revision to determine whether a comment row should be visible or not. These sorts of things would also likely want indexes on columns like ipblocks.ipb_reason_id that aren't using these temporary tables.

n-n relationship tables

This isn't an n-n relation, it's n-1. Each revision has only one comment, it's an error if there are multiple rows with the same revcomment_rev.

n-n or 1-n, same things apply. I meant "link table/relationship table".

We shouldn't change mediawiki based on labs needs (although we can facilitate it). Labs could have its own extra indexes if needed.

n-n or 1-n, same things apply.

Except my question is "what's the best way to add a unique constraint on (revcomment_rev)?" and that hasn't been answered.

Well, on top of the previous thing, if additionally one of the column needs a unique constraint, there is only one way to do it, make it unique. That is completely independent from the performance or best practices of the tables, it is just a contraint.

Change 357892 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] WIP: Add comment table and code to start using it

https://gerrit.wikimedia.org/r/357892

Change 357599 merged by jenkins-bot:
[mediawiki/core@master] Add constants for schema migration feature flags

https://gerrit.wikimedia.org/r/357599

I wrote about my concerns about the deduplication scheme at T153333#3491632 , since that's where most of the discussion on that topic was. Apologies for the late review, but seeing the code has made the problems clearer to me.

Change 357892 merged by jenkins-bot:
[mediawiki/core@master] Add comment table and code to start using it

https://gerrit.wikimedia.org/r/357892

Change 374856 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/AbuseFilter@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374856

Change 374857 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/CentralAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374857

Change 374858 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/CentralNotice@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374858

Change 374859 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/CheckUser@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374859

Change 374860 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/FlaggedRevs@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374860

Change 374861 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Flow@master] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/374861

Change 374862 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/MobileFrontend@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374862

Change 374863 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/OAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374863

Change 374864 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Renameuser@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374864

Change 374865 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/WikimediaMaintenance@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374865

Change 374862 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374862

Change 374857 merged by jenkins-bot:
[mediawiki/extensions/CentralAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374857

Change 374865 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374865

Change 374859 merged by jenkins-bot:
[mediawiki/extensions/CheckUser@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374859

Change 374864 merged by jenkins-bot:
[mediawiki/extensions/Renameuser@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374864

Change 374861 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/374861

Change 374863 merged by jenkins-bot:
[mediawiki/extensions/OAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374863

Change 374856 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374856

Change 374858 merged by jenkins-bot:
[mediawiki/extensions/CentralNotice@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374858

Change 374860 merged by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374860

That's the last open patch for this merged – does that mean this is Resolved? Obviously there's still the migration of content in Wikimedia production, but I guess that's for T166733 not this ticket?

There was a mention on IRC that something is hitting a "Using deprecated fallback handling for comment rev_comment" warning, but I haven't heard back with a stack trace yet.

Change 376286 had a related patch set uploaded (by Chad; owner: Anomie):
[mediawiki/extensions/Flow@wmf/1.30.0-wmf.17] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/376286

Change 376286 merged by Chad:
[mediawiki/extensions/Flow@wmf/1.30.0-wmf.17] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/376286

There was a mention on IRC that something is hitting a "Using deprecated fallback handling for comment rev_comment" warning, but I haven't heard back with a stack trace yet.

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.09.06/mediawiki?id=AV5Ye5F3ePsvZ6Lqq2vp&_g=()?

There was a mention on IRC that something is hitting a "Using deprecated fallback handling for comment rev_comment" warning, but I haven't heard back with a stack trace yet.

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.09.06/mediawiki?id=AV5Ye5F3ePsvZ6Lqq2vp&_g=()?

Thanks! I also managed to find it in error.log on mwlog1001 this morning. It turns out there were three different traces: one in Flow fixed by https://gerrit.wikimedia.org/r/#/c/374861/, and two in MobileFrontend that are being tracked at T175161.

Ok, let's mark this resolved. The only remaining known issue is tracked in T175161.

For my own convenience (and in case someone asks), WMF deployment is coordinated on T174569.

Change 377779 had a related patch set uploaded (by Chad; owner: Anomie):
[mediawiki/extensions/AbuseFilter@wmf/1.30.0-wmf.17] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/377779

Change 377779 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@wmf/1.30.0-wmf.17] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/377779

Change 404506 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Non-MySQL comment table updates

https://gerrit.wikimedia.org/r/404506

Change 404506 merged by jenkins-bot:
[mediawiki/core@master] Non-MySQL comment table updates

https://gerrit.wikimedia.org/r/404506

Quick question: the description shows plans to refactor cu_changes.cuc_comment to cuc_comment_id. The activity on this task seems to have stopped and this refactor doesn't seem to have happened yet. Will it happen eventually or has it been abandoned?

Tasks should be filed against CheckUser and other extensions to handle their conversions.

Right, but this task description mentions the CheckUser change, so either this description should be updated or subtasks should be added here, no? Otherwise how would people like me know where to look?

Pinging @WDoranWMF to ask the question of whether the cu_changes mentioned in this ticket are happening and wether we have a timeline for those

hi @Nuria I'll check into it and find out but I'm not aware of the CheckUser changes planned to be worked on at the moment. I'll split those changes out into their own task. Is this blocking should we increase that tasks priority?

@WDoranWMF not blocking, we just need to know cause changes such as these affects our history reconstruction algorithms and we need to adjust them. If you create a task and tag Analytics so we can follow progress it will be excellent, many thanks.

Ok perfect, I've created T232531. It's in our process now so I'll look into when we can schedule. I've tagged Analytics, I'll make sure to do the same for other tasks that impact this. Thanks!