Refactor comment storage in the database and abstract access in MediaWiki
Closed, ResolvedPublic
Actions

Description

This is the task to implement T153333: RFC: How should we store longer revision comments?, now that the request has been closed. The text below is my current plan for the code I need to write.

Database

We'll create a new comment table that looks like this:

CREATE TABLE /*_*/comment(
  comment_id bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  comment_text BLOB NOT NULL,
  comment_data BLOB NOT NULL
) /*$wgDBTableOptions*/;
CREATE INDEX /*i*/comment_text ON comment (comment_text(100));

Comments stored in this table will be deduplicated. Replication to Labs will need to hide rows that aren't referred to by any of the tables listed below, or that are only referred to by the tables below in rows with revision-deleted comments. I'll provide more details on that in the schema change subtask, once I get to that point.

The end state will be to make the following changes to existing tables to refer to this new table instead of storing the comment directly:

revision.rev_comment → revision.rev_comment_id
archive.ar_comment → archive.ar_comment_id
ipblocks.ipb_reason → ipblocks.ipb_reason_id
image.img_description → image.img_description_id
oldimage.oi_description → oldimage.oi_description_id
filearchive.fa_deleted_reason → filearchive.fa_deleted_reason_id
filearchive.fa_description → filearchive.fa_description_id
recentchanges.rc_comment → recentchanges.rc_comment_id
logging.log_comment → logging.log_comment_id
protected_titles.pt_reason → protected_titles.pt_reason_id
cu_changes.cuc_comment → cu_changes.cuc_comment_id (from CheckUser)
cu_log.cul_reason → cu_log.cul_reason_id (from CheckUser)
- These two should be ignored when determining visibility in the Labs replication
global_block_whitelist.gbw_reason → global_block_whitelist.gbw_reason_id (from GlobalBlocking)
globalblocks.gb_reason → globalblocks.gb_reason_id (from GlobalBlocking)
flow_revision.rev_content → flow_revision.rev_content_id (from StructuredDiscussions)

However, since schema changes on large busy tables are a lot of work that we don't want to be blocked on for months, the revision and image tables will use a temporary auxiliary table that will later be merged into the main tables:

CREATE TABLE /*_*/revision_comment_temp (
  revcomment_rev bigint unsigned NOT NULL,     -- → revision.rev_id
  revcomment_comment bigint unsigned NOT NULL, -- → comment.comment_id
  PRIMARY KEY (revcomment_rev, revcomment_comment)
) /*$wgDBTableOptions*/;

CREATE TABLE /*_*/image_comment_temp (
  imgcomment_image varchar(255) binary NOT NULL, -- → image.img_name, ugh
  imgcomment_comment bigint unsigned NOT NULL,   -- → comment.comment_id
  PRIMARY KEY (imgcomment_image, imgcomment_comment)
) /*$wgDBTableOptions*/;

MediaWiki code

There will be a temporary feature flag with four states:

Read and write the old columns only.
Write both the old and new columns. Read from new preferentially, falling back to old.
Write only the new columns. Read from new preferentially, falling back to old.
Read and write the new columns only.

There will be a class that manages the feature flag and such. Methods it'll probably need:

getFields( $key ): Pass e.g. rev_comment, returns the fields that should be selected. The comment might have to be lazy-loaded later.
getJoin( $key ): Pass e.g. rev_comment, returns the tables to join, join conditions, and fields that should be selected.
insert( $dbw, $key, $comment ): Inserts the comment into the comment table, if applicable and necessary, and returns a mapping of field → value to be included in the insert into the main table.
- I'll want a version of this that takes some LogEntry-like class (or maybe just use LogEntry?) and generates the comment_text and comment_data from it.
getComment( $key, $row ): Returns the comment string (or message?). May need to make a DB query if getFields() was used instead of getJoin().

Then I'll need to search the code for references to the old columns and patch them to use this new class.

Details

Subject	Repo	Branch	Lines /-
Add `comment` table and code to start using it	mediawiki/core	master	3 K -342
Non-MySQL comment table updates	mediawiki/core	master	260 -9
Use CommentStore to access core comment fields when available	mediawiki/extensions/AbuseFilter	wmf/1.30.0-wmf.17	40 -8
Use CommentStore to access core comment fields when available	mediawiki/extensions/AbuseFilter	master	40 -8
Handle new fields for RecentChange object attributes	mediawiki/extensions/Flow	wmf/1.30.0-wmf.17	5 -0
Handle new fields for RecentChange object attributes	mediawiki/extensions/Flow	master	5 -0
Use CommentStore to access core comment fields	mediawiki/extensions/FlaggedRevs	master	12 -8
Use CommentStore to access core comment fields when available	mediawiki/extensions/CentralNotice	master	27 -3
Use CommentStore to access core comment fields	mediawiki/extensions/OAuth	master	9 -5
Use CommentStore to access core comment fields when available	mediawiki/extensions/Renameuser	master	20 -5
Use CommentStore to access core comment fields	mediawiki/extensions/CheckUser	master	25 -3
Use CommentStore to access core comment fields	mediawiki/extensions/WikimediaMaintenance	master	29 -5
Use CommentStore to access core comment fields	mediawiki/extensions/CentralAuth	master	11 -5
Use CommentStore to access core comment fields when available	mediawiki/extensions/MobileFrontend	master	16 -1
Add constants for schema migration feature flags	mediawiki/core	master	25 -0

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	Anomie	T6715 Allow comments longer than 255 bytes
Resolved	Anomie	T166733 Deploy refactored comment storage
Resolved	daniel	T174038 Initial implementation of MCR page update interface
Resolved	daniel	T174024 Implement MCR revision retrieval interface
Resolved	Anomie	T166732 Refactor comment storage in the database and abstract access in MediaWiki
Resolved	None	T166798 Decide whether back-compat views for upcoming major schema changes will be provided in the Labs replicas
Duplicate	None	T232531 Refactor Comment fields for CheckUser Component
		· · ·

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Jdforrester-WMF subscribed.Jun 1 2017, 6:43 PM

Jack_who_built_the_house subscribed.Jun 1 2017, 11:39 PM

Niharika subscribed.Jun 5 2017, 2:44 PM

• brooke mentioned this in T167246: Refactor "user" & "user_text" fields into "actor" reference table.Jun 7 2017, 2:14 AM

Change 357599 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Add constants for schema migration feature flags

https://gerrit.wikimedia.org/r/357599

gerritbot added a project: Patch-For-Review.Jun 7 2017, 1:40 PM

@jcrespo: It'd be nice to prevent duplicate rows in these temporary tables at the database level as well as the application level, which means a unique index on the revcomment_rev column. Which do you think would be best?

Primary key on (revcomment_rev, revcomment_comment), unique index on (revcomment_rev).
Primary key on (revcomment_rev), covering non-unique index on (revcomment_rev, revcomment_comment).
Primary key on (revcomment_rev), no covering index.

When I test these on a small local database (around 4000 rows), EXPLAIN on #1 uses a 'ref' on PRIMARY with "Using index" while the other two both prefer an 'eq_ref' on PRIMARY with no "Using index".

I will answer based on certain assumptions, please tell me if those are wrong:

Most, if not all of the accesses will be from revision to comment. There may be accesses from comment to revision, but probably for maintenance or analytics tasks. Basically, I expect queries to be:

SELECT ... FROM revision JOIN revision_comment_temp ON revision.rev_id = revcomment.revcomment_rev JOIN comment on revcomment.revcomment_rev = comment.comment_id WHERE /* filter on revision */

n-n relationship tables are normally setup with a covering PK in the order they are most frequently access, that would be (revcomment_rev, revcomment_comment), that would allow them to be unique, clustered by the main access, and using a covering index. If the other access is needed, create an additional index on (revcomment_comment) or on (revcomment_comment, revcomment_rev), but for this second one, I would wait for it to be necesary, for queries like:

SELECT ... FROM comment JOIN comment on revcomment.revcomment_rev = comment.comment_id JOIN revision ON revision.rev_id = revcomment.revcomment_rev WHERE /* filter on comment */

Where it is expected for the join to go in the opposite direction. I would assume that is not currently the case because I think there is currently no index on comment, but please tell me otherwise, as that will change my suggestion. I strongly suggest to not add such a possibility on a first step (it is ok, if it is for maintenance/migration purposes).

In T166732#3323750, @jcrespo wrote:

Most, if not all of the accesses will be from revision to comment. There may be accesses from comment to revision, but probably for maintenance or analytics tasks.

For production usage, all accesses will be from revision to comment.

In addition to potential maintenance or analytics tasks, Labs replica filtering will probably need to make accesses from comment to revision to determine whether a comment row should be visible or not. These sorts of things would also likely want indexes on columns like ipblocks.ipb_reason_id that aren't using these temporary tables.

n-n relationship tables

This isn't an n-n relation, it's n-1. Each revision has only one comment, it's an error if there are multiple rows with the same revcomment_rev.

n-n or 1-n, same things apply. I meant "link table/relationship table".

We shouldn't change mediawiki based on labs needs (although we can facilitate it). Labs could have its own extra indexes if needed.

n-n or 1-n, same things apply.

Except my question is "what's the best way to add a unique constraint on (revcomment_rev)?" and that hasn't been answered.

Well, on top of the previous thing, if additionally one of the column needs a unique constraint, there is only one way to do it, make it unique. That is completely independent from the performance or best practices of the tables, it is just a contraint.

Ok, thanks.

Bamyers99 subscribed.Jun 7 2017, 9:51 PM

tstarling moved this task from Inbox to Watching on the MediaWiki-Platform-Team-Archived board.Jun 8 2017, 6:27 AM

-jem- subscribed.Jun 8 2017, 10:58 AM

Blahma subscribed.Jun 8 2017, 12:41 PM

Change 357892 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] WIP: Add comment table and code to start using it

https://gerrit.wikimedia.org/r/357892

• Tbayer subscribed.Jun 12 2017, 10:08 AM

Change 357599 merged by jenkins-bot:
[mediawiki/core@master] Add constants for schema migration feature flags

https://gerrit.wikimedia.org/r/357599

zhuyifei1999 subscribed.Jul 25 2017, 8:13 PM

I wrote about my concerns about the deduplication scheme at T153333#3491632 , since that's where most of the discussion on that topic was. Apologies for the late review, but seeing the code has made the problems clearer to me.

fgiunchedi subscribed.Aug 2 2017, 7:41 AM

daniel added a parent task: T174024: Implement MCR revision retrieval interface.Aug 24 2017, 2:29 PM

daniel added a parent task: T174038: Initial implementation of MCR page update interface.Aug 24 2017, 2:57 PM

Change 357892 merged by jenkins-bot:
[mediawiki/core@master] Add comment table and code to start using it

https://gerrit.wikimedia.org/r/357892

Change 374856 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/AbuseFilter@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374856

Change 374857 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/CentralAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374857

Change 374858 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/CentralNotice@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374858

Change 374859 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/CheckUser@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374859

Change 374860 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/FlaggedRevs@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374860

Change 374861 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Flow@master] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/374861

Change 374862 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/MobileFrontend@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374862

Change 374863 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/OAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374863

Change 374864 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/Renameuser@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374864

Change 374865 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/WikimediaMaintenance@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374865

Change 374862 merged by jenkins-bot:
[mediawiki/extensions/MobileFrontend@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374862

ReleaseTaggerBot added a project: MW-1.30-release-notes (WMF-deploy-2017-09-05 (1.30.0-wmf.17)).Sep 1 2017, 3:00 PM

Change 374857 merged by jenkins-bot:
[mediawiki/extensions/CentralAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374857

Change 374865 merged by jenkins-bot:
[mediawiki/extensions/WikimediaMaintenance@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374865

Change 374859 merged by jenkins-bot:
[mediawiki/extensions/CheckUser@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374859

Change 374864 merged by jenkins-bot:
[mediawiki/extensions/Renameuser@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374864

Change 374861 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/374861

Change 374863 merged by jenkins-bot:
[mediawiki/extensions/OAuth@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374863

Change 374856 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374856

Change 374858 merged by jenkins-bot:
[mediawiki/extensions/CentralNotice@master] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/374858

Change 374860 merged by jenkins-bot:
[mediawiki/extensions/FlaggedRevs@master] Use CommentStore to access core comment fields

https://gerrit.wikimedia.org/r/374860

That's the last open patch for this merged – does that mean this is Resolved? Obviously there's still the migration of content in Wikimedia production, but I guess that's for T166733 not this ticket?

There was a mention on IRC that something is hitting a "Using deprecated fallback handling for comment rev_comment" warning, but I haven't heard back with a stack trace yet.

Change 376286 had a related patch set uploaded (by Chad; owner: Anomie):
[mediawiki/extensions/Flow@wmf/1.30.0-wmf.17] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/376286

Change 376286 merged by Chad:
[mediawiki/extensions/Flow@wmf/1.30.0-wmf.17] Handle new fields for RecentChange object attributes

https://gerrit.wikimedia.org/r/376286

In T166732#3582015, @Anomie wrote:

There was a mention on IRC that something is hitting a "Using deprecated fallback handling for comment rev_comment" warning, but I haven't heard back with a stack trace yet.

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.09.06/mediawiki?id=AV5Ye5F3ePsvZ6Lqq2vp&_g=()?

Navigating to both https://www.mediawiki.org/wiki/Special:History/Manual:$wgPreviewOnOpenNamespaces/pl and https://m.mediawiki.org/wiki/Special:History/Manual:$wgPreviewOnOpenNamespaces/pt-br doesn't throw for me, but Kibana has about 1,500 an hour happening right now (with just group0/1 on wmf.17).

In T166732#3585891, @Niharika wrote:

In T166732#3582015, @Anomie wrote:

There was a mention on IRC that something is hitting a "Using deprecated fallback handling for comment rev_comment" warning, but I haven't heard back with a stack trace yet.

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2017.09.06/mediawiki?id=AV5Ye5F3ePsvZ6Lqq2vp&_g=()?

Thanks! I also managed to find it in error.log on mwlog1001 this morning. It turns out there were three different traces: one in Flow fixed by https://gerrit.wikimedia.org/r/#/c/374861/, and two in MobileFrontend that are being tracked at T175161.

Ok, let's mark this resolved. The only remaining known issue is tracked in T175161.

For my own convenience (and in case someone asks), WMF deployment is coordinated on T174569.

TheDJ mentioned this in T175443: [BUG] API:Imageinfo returns null for &iiprop=comment.Sep 9 2017, 7:45 AM

Change 377779 had a related patch set uploaded (by Chad; owner: Anomie):
[mediawiki/extensions/AbuseFilter@wmf/1.30.0-wmf.17] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/377779

Change 377779 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@wmf/1.30.0-wmf.17] Use CommentStore to access core comment fields when available

https://gerrit.wikimedia.org/r/377779

CCicalese_WMF edited projects, added MediaWiki-Platform-Team-Archived (MWPT-Q1-Jul-Sep-2017); removed MediaWiki-Platform-Team-Archived.Oct 13 2017, 9:25 PM

CCicalese_WMF moved this task from Ready to Done on the MediaWiki-Platform-Team-Archived (MWPT-Q1-Jul-Sep-2017) board.

Umherirrender mentioned this in T181930: Comment schema migration and foreign db repos not working well.Dec 3 2017, 12:33 PM

Change 404506 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Non-MySQL comment table updates

https://gerrit.wikimedia.org/r/404506

Change 404506 merged by jenkins-bot:
[mediawiki/core@master] Non-MySQL comment table updates

https://gerrit.wikimedia.org/r/404506

ReleaseTaggerBot added a project: MW-1.31-release-notes (WMF-deploy-2018-02-06 (1.31.0-wmf.20)).Jan 18 2018, 4:00 AM

Anomie mentioned this in T188132: Merge image_comment_temp table into the image table.Feb 23 2018, 7:33 PM

• Nancywilson1 subscribed.Oct 15 2018, 4:53 AM

This comment was removed by Legoktm.

Legoktm subscribed.Oct 15 2018, 7:15 AM

MGChecker added a project: MediaWiki-Comment-store.Nov 12 2018, 7:45 PM

Lucas_Werkmeister_WMDE mentioned this in T215422: Migrate Wikibase to use comment_data field instead of SummaryFormatter.Feb 6 2019, 3:14 PM

Lucas_Werkmeister_WMDE mentioned this in T215637: Implement translatable edit summaries / multilingual comments using comment_data.Feb 8 2019, 5:47 PM

• Bstorm closed subtask T166798: Decide whether back-compat views for upcoming major schema changes will be provided in the Labs replicas as Invalid.Feb 8 2019, 6:14 PM

Anomie changed the status of subtask T166798: Decide whether back-compat views for upcoming major schema changes will be provided in the Labs replicas from Invalid to Resolved.Feb 8 2019, 6:58 PM

Milimetric mentioned this in T217848: Sqoop: remove cuc_comment and join to comment table.Mar 7 2019, 4:34 PM

Pathoschild mentioned this in T217853: tools.meta was creating high cpu load on wikireplicas, rate-limited.Mar 8 2019, 3:53 PM

Lucas_Werkmeister_WMDE mentioned this in T218127: Use CommentStoreComment throughout MediaWiki.Mar 12 2019, 4:41 PM

Quick question: the description shows plans to refactor cu_changes.cuc_comment to cuc_comment_id. The activity on this task seems to have stopped and this refactor doesn't seem to have happened yet. Will it happen eventually or has it been abandoned?

Tasks should be filed against CheckUser and other extensions to handle their conversions.

Right, but this task description mentions the CheckUser change, so either this description should be updated or subtasks should be added here, no? Otherwise how would people like me know where to look?

Pinging @WDoranWMF to ask the question of whether the cu_changes mentioned in this ticket are happening and wether we have a timeline for those

hi @Nuria I'll check into it and find out but I'm not aware of the CheckUser changes planned to be worked on at the moment. I'll split those changes out into their own task. Is this blocking should we increase that tasks priority?

• WDoranWMF mentioned this in T232531: Refactor Comment fields for CheckUser Component.Sep 10 2019, 8:31 PM

@WDoranWMF not blocking, we just need to know cause changes such as these affects our history reconstruction algorithms and we need to adjust them. If you create a task and tag Analytics so we can follow progress it will be excellent, many thanks.

Ok perfect, I've created T232531. It's in our process now so I'll look into when we can schedule. I've tagged Analytics, I'll make sure to do the same for other tasks that impact this. Thanks!

• WDoranWMF added a subtask: T232531: Refactor Comment fields for CheckUser Component.Sep 10 2019, 8:37 PM

Anomie added a subtask: T233004: Update CheckUser for actor and comment table.Sep 17 2019, 6:43 PM

Aklapper removed subscribers: • Tbayer, Anomie.Oct 16 2020, 5:43 PM

Izno removed a subtask: T233004: Update CheckUser for actor and comment table.Apr 27 2021, 5:29 PM

bd808 mentioned this in T294448: <Break-Fix> IFTTT integration: fix or sunset?.Feb 12 2023, 9:24 PM

Maintenance_bot removed a project: Patch-For-Review.Feb 12 2023, 9:30 PM

Restricted Application added a project: Performance-Team. · View Herald TranscriptFeb 12 2023, 9:30 PM

	Anomie
	May 31 2017, 9:05 PM

Refactor comment storage in the database and abstract access in MediaWikiClosed, ResolvedPublicActions