Page MenuHomePhabricator

Detect manual reverts
Closed, ResolvedPublic

Description

As part of T254074: Implement the reverted edit tag we want to implement manual revert detection in addition to the current undo- and rollback-based triggers. A manual revert is made when the editor reverts the page to an exact previous state, not using the built-in undo or rollback functionality. See T252366 for a discussion on this.

This should be implemented using exact SHA-1 matching of revisions in the database vs the current revision. As that field lacks an index over it, we'll need to search only within a certain radius of revisions, which can be configurable. This paper suggests using 15 revisions as the default radius (that is also what Product Analytics team uses by default for their mwreverts library). I think this can be done with a single relatively inexpensive DB query utilizing the LIMIT clause. I hope the query optimizer is smart enough to make it a scan of 15 rows, instead of the entire table, I'll have to double-check this, though. I think we'll need to read this data from master, to avoid issues with replication lag, though I'm not 100% sure on this.

Most of the logic will probably live in the EditResultBuilder. The code will be triggered by EditResultBuilder::buildEditResult().

Also: should we mark manual reverts with a software-defined edit tag, like mw-undo and mw-rollback? This would certainly be consistent with the other forms of reverts, and the fact that these manual reverts will mark reverted edits just like undo and rollback will.

Event Timeline

After some experimenting I came up with this mockup (I tested it on MariaDB 10.3):

select * from (
        select * from revision
        where rev_page = 248348
        order by rev_id desc
        limit 15
    ) revs_filtered
    where rev_sha1 = 'civ8b9x9x0xbvyvy8cnr5a9fle2cdqc';

This will select the 15 most recent revisions and then check if any of them has the SHA-1 we're looking for, so the scan is limited to 15 rows. This particular page has over 1500 revisions.
EXPLAIN for this query:

 ------ ------------- ------------ ------ ------------------------------------------------ ------------- --------- ------- ------ ------------- 
| id   | select_type | table      | type | possible_keys                                  | key         | key_len | ref   | rows | Extra       |
 ------ ------------- ------------ ------ ------------------------------------------------ ------------- --------- ------- ------ ------------- 
|    1 | PRIMARY     | <derived2> | ALL  | NULL                                           | NULL        | NULL    | NULL  |   15 | Using where |
|    2 | DERIVED     | revision   | ref  | rev_page_id,page_timestamp,page_user_timestamp | rev_page_id | 4       | const | 1556 | Using where |
 ------ ------------- ------------ ------ ------------------------------------------------ ------------- --------- ------- ------ ------------- 

And, according to MySQL manual:

If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast.

In this case we do have a key like this, rev_page_id that consists of rev_page and rev_id. Sooo I think we should be fine here, we have an index scan, then an index-based sort operation limited to N rows and then a table scan on a small number of rows.

Correction: we should order revisions by timestamp, not by their IDs (which can do all kinds of crazy things). So the order clause should look like order by rev_timestamp desc.

The conclusion though is the same: we still have an index that would suit this use case, page_timestamp.

Change 607855 had a related patch set uploaded (by Ostrzyciel; owner: Ostrzyciel):
[mediawiki/core@master] WIP: EditResultBuilder: manual revert detection

https://gerrit.wikimedia.org/r/607855

I just found T216297 which contains an interesting discussion on this topic. One thing in particular made me curious:

Note that manual reverts (by clicking edit on an older revision, and then saving) is detected by hash, but the "original rev ID" is not set in this case. This could be done without too much trouble, but if I recall correctly, this may confuse some extensions.

@daniel, do you remember what extensions could be affected by such an approach? Or do you have some hints on how to find them? Currently my plan is to do just that, I'll be setting the original rev ID to the edit a user manually reverted to. I had hoped this wouldn't break anything :D

Change 609242 had a related patch set uploaded (by Ostrzyciel; owner: Ostrzyciel):
[mediawiki/core@master] Add mw-manual-revert change tag

https://gerrit.wikimedia.org/r/609242

Change 607855 merged by jenkins-bot:
[mediawiki/core@master] EditResultBuilder: manual revert detection

https://gerrit.wikimedia.org/r/607855

Change 609242 merged by jenkins-bot:
[mediawiki/core@master] Add mw-manual-revert change tag

https://gerrit.wikimedia.org/r/609242

Change 613801 had a related patch set uploaded (by DannyS712; owner: Shirayuki):
[mediawiki/core@master] PageUpdater: Update @since tag from 1.35 to 1.36 for $wgManualRevertSearchRadius

https://gerrit.wikimedia.org/r/613801

Change 613801 merged by jenkins-bot:
[mediawiki/core@master] PageUpdater: Update @since tag from 1.35 to 1.36 for $wgManualRevertSearchRadius

https://gerrit.wikimedia.org/r/613801

As a bot owner, I would like to mention several situations where the "manual revert" tag is applied in a situation where it doesn't seem relevant. The tag itself doesn't do much harm, but as far as I understand, there is a plan to notify users from those manual reverts (T154637), so this could become more annoying.

Will there be a way to exclude some users or pages from that mechanism?