This is the task to implement T153333: RFC: How should we store longer revision comments?, now that the request has been closed. The text below is my current plan for the code I need to write.
Database
We'll create a new comment table that looks like this:
CREATE TABLE /*_*/comment( comment_id bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT, comment_text BLOB NOT NULL, comment_data BLOB NOT NULL ) /*$wgDBTableOptions*/; CREATE INDEX /*i*/comment_text ON comment (comment_text(100));
Comments stored in this table will be deduplicated. Replication to Labs will need to hide rows that aren't referred to by any of the tables listed below, or that are only referred to by the tables below in rows with revision-deleted comments. I'll provide more details on that in the schema change subtask, once I get to that point.
The end state will be to make the following changes to existing tables to refer to this new table instead of storing the comment directly:
- revision.rev_comment → revision.rev_comment_id
- archive.ar_comment → archive.ar_comment_id
- ipblocks.ipb_reason → ipblocks.ipb_reason_id
- image.img_description → image.img_description_id
- oldimage.oi_description → oldimage.oi_description_id
- filearchive.fa_deleted_reason → filearchive.fa_deleted_reason_id
- filearchive.fa_description → filearchive.fa_description_id
- recentchanges.rc_comment → recentchanges.rc_comment_id
- logging.log_comment → logging.log_comment_id
- protected_titles.pt_reason → protected_titles.pt_reason_id
- cu_changes.cuc_comment → cu_changes.cuc_comment_id (from CheckUser)
- cu_log.cul_reason → cu_log.cul_reason_id (from CheckUser)
- These two should be ignored when determining visibility in the Labs replication
- global_block_whitelist.gbw_reason → global_block_whitelist.gbw_reason_id (from GlobalBlocking)
- globalblocks.gb_reason → globalblocks.gb_reason_id (from GlobalBlocking)
- flow_revision.rev_content → flow_revision.rev_content_id (from StructuredDiscussions)
However, since schema changes on large busy tables are a lot of work that we don't want to be blocked on for months, the revision and image tables will use a temporary auxiliary table that will later be merged into the main tables:
CREATE TABLE /*_*/revision_comment_temp ( revcomment_rev bigint unsigned NOT NULL, -- → revision.rev_id revcomment_comment bigint unsigned NOT NULL, -- → comment.comment_id PRIMARY KEY (revcomment_rev, revcomment_comment) ) /*$wgDBTableOptions*/; CREATE TABLE /*_*/image_comment_temp ( imgcomment_image varchar(255) binary NOT NULL, -- → image.img_name, ugh imgcomment_comment bigint unsigned NOT NULL, -- → comment.comment_id PRIMARY KEY (imgcomment_image, imgcomment_comment) ) /*$wgDBTableOptions*/;
MediaWiki code
There will be a temporary feature flag with four states:
- Read and write the old columns only.
- Write both the old and new columns. Read from new preferentially, falling back to old.
- Write only the new columns. Read from new preferentially, falling back to old.
- Read and write the new columns only.
There will be a class that manages the feature flag and such. Methods it'll probably need:
- getFields( $key ): Pass e.g. rev_comment, returns the fields that should be selected. The comment might have to be lazy-loaded later.
- getJoin( $key ): Pass e.g. rev_comment, returns the tables to join, join conditions, and fields that should be selected.
- insert( $dbw, $key, $comment ): Inserts the comment into the comment table, if applicable and necessary, and returns a mapping of field → value to be included in the insert into the main table.
- I'll want a version of this that takes some LogEntry-like class (or maybe just use LogEntry?) and generates the comment_text and comment_data from it.
- getComment( $key, $row ): Returns the comment string (or message?). May need to make a DB query if getFields() was used instead of getJoin().
Then I'll need to search the code for references to the old columns and patch them to use this new class.