See T252812: Investigate watchlist sizes (limiting or handling large ones properly) for context. This task documents ideas for ways to purge the watchlist. The end result may for instance be a maintenance script that DBAs can run occasionally.
Problem
According to the DBAs, the watchlist table has grown to be one of the most problematic tables. The extreme size can cause the query optimizer to malfunction.
Unlike tables like revision and logging, the watchlist table doesn't need to be static. It serves only to convenience the end user in monitoring pages. If the end user isn't using their watchlist, the corresponding rows in the table are needlessly occupying space and in the long run affect the health and performance of the table for all other users.
In addition, we have for years provided ways to automatically watch pages via preferences -- which are even on by default -- inviting scalability issues. Expiring-Watchlist-Items for the first time provides means to automatically unwatch, which is likely to help, but not eliminate the long-term issue.
Idea
(not a concrete proposal)
Create a maintenance script to purge the many millions of unused rows in the watchlist.
Here are some ideas for the criteria; some or all may not be safe assumptions and will surely need community input:
- Bots: On WMF wikis, the "watch pages and files I create" preference is turned on by default. Consequently, bots have created millions of rows in the watchlist table when seemingly they are not actually be making use of the watchlist. Some examples:
- commonswiki: 14% (~22 million rows) of the watchlist table are owed to bots, which appear to be bots that upload files and simply have the default preference set to watch them
- wikidatawiki: 1.1% (~91 million rows) -- for instance bots that automatically create items after articles on Wikipedia are created
- enwiki: 4.7% (~10.3 million rows) -- take counter-vandalism bots for instance; they create User talk pages when issuing warnings
- mgwiktionary: 99.8% (~13.2 million rows) -- a single bot mass-created nearly every entry on the wiki
- Foundation-banned users: They are explicitly not welcomed to return to our projects, so presumably their watchlist could safely be cleared
- Community-banned/blocked users: After some grace period (say, 5 years) it may be acceptable to clear their watchlists
- Retired users: Some accounts have left the project ages ago with no intention to return, but their rows in the watchlist table remain. This (sadly) may include deceased users, many of which were quite prolific.
- Deleted pages: Users typically don't think to unwatch pages after they've been deleted. There are legitimate reasons to continue watching these titles (say to be notified when they're recreated), but the ratio of when they are recreated versus never return might be worth investigating. Perhaps some sort of grace period is reasonable -- say if the page was deleted 5 years ago, the purging routine could automatically unwatch them.
- Humans using (semi-)automation: Similar issue to with bots. Some non-bot accounts use automation to mass-create pages, unknowingly watching each and every one of them. It will be difficult to identify such accounts, but for instance if a human is watching millions of pages, it's probably safe to assume they aren't getting much out of Special:Watchlist due to how slow it is, if it even loads at all.