Page MenuHomePhabricator

Create a special page to handle additions, removals, changes and logging of spam blacklist entries
Open, Needs TriagePublic

Assigned To
None
Authored By
JakobVoss
Jan 3 2006, 4:06 AM
Referenced Files
F28265615: spamblacklist.png
Feb 22 2019, 6:28 PM
Tokens
"Love" token, awarded by Liuxinyu970226."Love" token, awarded by Ladsgroup."Love" token, awarded by MarcoAurelio."Like" token, awarded by He7d3r.

Description

There should be a special page to manage the spam blacklist. Admins should be able to check URLs against the blacklist. I had an URL that was blacklisted and I could not find out which regular expression matched it (ok, I could write a simple perl script to to this on my computer). There are some other related suggestions about spam, see:

Maybe you should rewite the entire Spam protection mechanism.

Details

Reference
bz4459

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:00 PM
bzimport added a project: SpamBlacklist.
bzimport set Reference to bz4459.
bzimport added a subscriber: Unknown Object (MLST).

I finally found that www.2100books.com matches against [0-9] books\.com. So how
do I know which [0-9] books\.com pages are good and which are evil? Who entered
the regexp because of which pages? Maybe you can find out in the version history
but managing the spam blacklist should better be as easy as blocking users and
pages.

accnospamtom wrote:

A rewritten "Spam protection mechanism" should definetly be part
of each Mediawiki-installation. Many sites use this software, but
they don't install optional extensions. Sysops have to fight
wikispam without proper "weapons", as they usually have no access
to the servers.

This feature should be enabled by default (with an empty spam-
blacklist, which is editable by sysops.

robchur wrote:

Setting product correctly. I have seen a demand for a slightly easier-to-use
version of Spam Blacklist, so it might not be a bad idea to consider this a
separate request. Leaving ambiguous for now.

brian wrote:

(In reply to comment #0)

http://bugzilla.wikimedia.org/show_bug.cgi?id=1505

Or just bug 1505 - it automatically creates a link.

robchur wrote:

*** Bug 4698 has been marked as a duplicate of this bug. ***

mike.lifeguard bugs wrote:

*** Bug 13805 has been marked as a duplicate of this bug. ***

mike.lifeguard bugs wrote:

*** Bug 14090 has been marked as a duplicate of this bug. ***

mike.lifeguard bugs wrote:

If SpamRegex is fixed up, it might fulfil this need; see bug 13811.

mike.lifeguard bugs wrote:

(In reply to comment #8)

If SpamRegex is fixed up, it might fulfil this need; see bug 13811.

Per bug 13811 comment 14, that's apparently not true. This will probably be fulfilled by AbuseFilter, which Werdna is working on, so I've CCed him.

We don't have a special page yet, but there are tools like http://toolserver.org/~seth/grep_regexp_from_url.cgi which give the possibility to search for a entry and for its reason. This toll can be used in MediaWiki:Spamprotectionmatch, e.g., http://de.wikipedia.org/wiki/MediaWiki:Spamprotectionmatch/en.

So afaics the main thing - which was the difficulty in finding already blacklisted links - is solved.

mike.lifeguard bugs wrote:

(In reply to comment #10)

We don't have a special page yet, but there are tools like
http://toolserver.org/~seth/grep_regexp_from_url.cgi which give the possibility
to search for a entry and for its reason. This toll can be used in
MediaWiki:Spamprotectionmatch, e.g.,
http://de.wikipedia.org/wiki/MediaWiki:Spamprotectionmatch/en.

So afaics the main thing - which was the difficulty in finding already
blacklisted links - is solved.

External tools are *not* sufficient.

mike.lifeguard bugs wrote:

There are probably-useful notes on http://www.mediawiki.org/wiki/Extension_talk:SpamBlacklist#more_detailed_manual_and_suggestions and certainly on http://www.mediawiki.org/wiki/Regex-based_blacklist

Both AbuseFilter and SpamRegex would need lots of work to be a viable alternative to SpamBlacklist at present. Some of the major concerns with replacing SpamBlacklist with AbuseFilter follow (concerns regarding replacing SpamBlacklist with SpamRegex are discussed on bug 13811):

*Global filters (bug 17811) are really required since probably 1/3 our spam blocking as a Wikimedia community happens globally.
**Relatedly, local wikis would need some way to opt-out of blocking individual domains (or individual filters - and you might block multiple domains with a single filter - we do use regex after all :D)

*Also relatedly, we need to output for non-WMF wikis - but only the spam-related filters! So, probably some method of categorizing them will be necessary. That'd also be useful since if you have several thousand filters, it will quickly become *very* difficult to search through them all for a particular one - tagging/categorizing of filters and searching within the notes will be needed.
**As well, this assumes that all third parties will install AbuseFilter - which will not happen. So, ideally there would be a compatibility function to provide output at least somewhat equivalent to the output of SpamBlacklist which could be used as input for third party installations.

*Regarding workflow: AbuseFilter is not designed for blocking spam (it is meant to target pattern vandalism), and the workflow reflects that. We need to be able to quickly and painlessly add formulaic filters which do a very small subset of what AbuseFilter is capable of. I had suggested in the past that there could be filter templates for common purposes (such as blocking spam) - users would just fill in the blank and apply the filter.

*Performance: Someone should compare the performance effects of blocking all the domains we're currently blocking with SpamBlacklist using AbuseFilter instead (using one filter for each line of regex vs one filter for the whole thing would also be a useful comparison - is there an impact there? That could affect workflow significantly depending on the answer.)

*AbuseFilter can resolve bug 16325 in a user-friendly way: If all_links has whatever.com then present a particular message asking them to remove it (but potentially let them still save the edit or not, depending)

*For authors, showing the edit form after a hit (bug 16757) is important & AbuseFilter would resolve that.

*The AbuseFilter log would resolve bug 1542 nicely (& we are even replicating that to the toolserver).

*Rollback can be exempted easily, which would resolve bug 15450 perfectly.

*AbuseFilter can use new_html to resolve bug 15582 somewhat at least -- someone should figure out how true that statement is, since I'm no expert there. Potentially bug 16610 too?

*If AbuseFilter were modified, it could potentially resolve bug 16466 in an acceptable manner. Bug 14114 too?

*AbuseFilter could potentially resolve bug 16338 and bug 13599, depending on how one sets up the filters.

*AbuseFilter could maybe be modified to allow per-page exceptions (bug 12963)... something like an whitelist filter? Or you could mash that into the original filter, which goes back to the workflow problem.

*AbuseFilter's ccnorm() and/or rmspecials() would resolve the unicode problem (bug 12896) AFAICT -- though that should certainly be tested & verified.

*AbuseFilter's warn function would resolve bug 9416 in a very user-friendly manner.


In summation: AbuseFilter needs to implement global filters, local exemption, backward compatibility with SpamBlacklist on third-party installs, better filter tagging/searching and other workflow improvements before it can be considered a viable alternative to SpamBlacklist.

What about mw:Extension:Phalanx? Looks like a good tool.

CCing Jack Phoenix as he seems in charge of Phalanx.

What about mw:Extension:Phalanx? Looks like a good tool.

Phalanx has quite some redundancy with tools we already have (especially AbuseFilter), also it has quite some rough edges AFAIR.

In T6459#968333, @hoo wrote:

Phalanx has quite some redundancy with tools we already have (especially AbuseFilter), also it has quite some rough edges AFAIR.

This. Performance is also a big deal, and when it comes to Phalanx, it's just...not good. Which, I guess, in part explains why Wikia rewrote parts of Phalanx backend in Scala last year (see https://github.com/Wikia/scala-backend/tree/master/phalanx). Most pre-existing tools -- namely AbuseFilter, GlobalBlocking, SpamBlacklist & TitleBlacklist -- handle most of the tasks Phalanx does, too.

It's probably worth noting that SpamBlacklist hits (user X triggered spam filter on page Y) have been logged to a log viewable on Special:Log since Q3 2013 (see acaf4262d94269e55f9ac45179fc7159c961e346).

Anyway, as a final note on Phalanx, we should be working on improving pre-existing tools to make it redundant, specifically adding account blocking support to GlobalBlocking and whatnot else, but that's a whole different task not relevant to this report.

MarcoAurelio raised the priority of this task from Low to Needs Triage.Sep 20 2016, 6:36 AM
MarcoAurelio updated the task description. (Show Details)

as said already (above in this thread), you could use

https://tools.wmflabs.org/searchsbl

But this is, of course, just an external tool.

I think this will be a more modern way to handle URL blacklisting. I very much support this.

I am going to work out some thought experiment here. My suggestion to re-write the current spam-blacklist extension (or better, rewrite another extension):

  • take the current AbuseFilter, take out all the code that interprets the rule ('conditions').
  • Make 2 fields:
    • one text field for regexes that block added external links (the blacklist). Can contain many rules (one on each line).
    • one text field for regexes that override the block (whitelist overriding this blacklist field; that is generally simpler and cleaner than writing a complex regex, not everybody is a specialist on regexes).
  • Add namespace choice (checkboxes; so one can choose not to blacklist something in one particular namespace, or , with addition of an 'all', a 'content-namespace only' and 'talk-namespace only'.
  • Add user status choice (checkboxes for the different roles, or like the page-protection levels)
    • Some links are fine in discussions but should not be used in mainspace, others are a total nono
    • Some image links are find in the file-namespace to tell where it came from, but not needed in mainspace
  • Leave all the other options:
    • Discussion field for evidence (or better, a talk-page like function)
    • Enabled/disabled/deleted - not needed, turn it off, obsolete then delete
    • 'Flag the edit in the edit filter log' - maybe nice to be able to turn it off, to get rid of the real rubbish that doesn't need to be logged
    • Rate limiting - catch editors that start spamming an otherwise reasonably good link
    • Warn - could be a replacement for en:User:XLinkBot
    • Prevent the action - as is the current blacklist/whitelist function
    • Revoke autoconfirmed - make sure that spammers are caught and checked
    • Tagging - for combining certain rules to be checked by RC patrollers.
    • I would consider to add a button to auto-block editors on certain typical spambot-domains.

This should overall be much more lightweight than the current AbuseFilter (all it does is regex-testing as the spam-blacklist does, only it has to cycle through maybe thousands of AbuseFilters)

One could consider to expand it to have rules blocked or enabled on only certain pages, but that sounds complicated to me.

I know that this functionality is there in the current AbuseFilter, but running many regexes using AbuseFilter on every edit is going to be a burden for the servers, this should not be significantly more heavy than the current Spam Blacklist.

It looks like most of the consensus is in favor of creating a special page of some sort for the spam blacklist, since this task hasn't been WONTFIXed.

If there's no objection, I'm going to implement a special page, which will look similar to Special:Interwiki, and in which each entry will have the following options:

  • Display order (which will be an integer; used if you want to move entries around in the list)
  • Regex
  • User right (e.g. allow all users, but warn; allow only autoconfirmed; allow only extended autoconfirmed; etc.)
  • Warning/error message (if blank, will use the default)
  • Reason for change

If someone wants to just have a row reserved for a comment, they can just have a line starting with # as they do now, and all the other fields on that row will be ignored by the extension.

Any suggestions/comments? Any volunteers to take a look at the code once it's done? Thanks.

See mockup:

spamblacklist.png (1×1 px, 58 KB)

  • Title - the title for this set of entries.
  • Entry links to this page. Talk links to a discussion page for extended evidence. History links to the filter history. Hits links to the spam blacklist log for this filter.
  • Enabled/disabled/deleted functions like the Abusefilter.
  • Block autopromote - revoke autoconfirmed status. See Abusefilter.
  • Export to public blacklist - one upside of the spam blacklist being public is that it can be used externally. It is punitive against spammers. Public entries are broadcast at [[Spam blacklist]] and [[MediaWiki:Spam-blacklist]], whichever is relevant, so that they can be imported by external consumers and backwards compatibility retained. Non-exported entries are for internal use only (e.g. non-reliable sources).
  • The rest is as you describe.
  • Comments should be included.

Other comments:

  • Your implementation must be scalable to deal with at least 100x the current spam blacklists deal with today. I shouldn't be forced to ration spam blacklist entries for any reason (performance, usability, ...).
  • The summary special page where all entries are listed should look like [[Special:Abusefilter]]
  • There should be another special page that lists whitelist entries.
  • Display order is not important. Just chronologically enumerate entries like Special:Abusefilter numbers filters.

See mockup:

I'm wondering, can we split some of these features off into separate tasks, or does it all have to be in the same commit that adds the blacklist and whitelist special pages? I had in mind something closer in simplicity to Special:Interwiki, as opposed to Special:AbuseFilter, but maybe we need that extra complexity.

On a related note, is there anything that's currently done with the blacklist and whitelist pages, that's going to break when we switch to the special page format, unless we add that functionality back in? E.g. T157826 mentions some workarounds for dealing with the lack of certain features, e.g. using bots to revert additions of blacklisted items by non-autoconfirmed users.

As for the backend, here's what I'm thinking of for the spam_blacklist table:

  • sb_id (int(10) unsigned; the primary key)
  • sb_regex (BLOB, similar to af_pattern)
  • sb_actions (varbinary(255), similar to af_actions)
  • sb_level (varbinary(60) similar to pr_level)
  • sb_enabled (tinyint(1), similar to af_enabled)
  • sb_deleted (tinyint(1), similar to af_deleted)
  • sb_msg_ns (int(11), the page_namespace of the warning message)
  • sb_msg_title (varbinary(255), the page_title of the warning message)

I'm thinking future tasks could be to create a spam_blacklist_history table, similar to the abuse_filter_action table. There could also be a spam_blacklist_action table.

Also, another future task could be to add a sb_comment_id (blob) field, to link to a comment_id, if we want to split comments away into a separate field from the regex rather than having them after the #.

Or maybe we want to have some/all of that stuff from the get-go. I just don't want to make this commit too ambitious, since that might make the review process harder.

For example, do we need a history table from the get-go, or can we just rely on the logging and/or comment table for that, the way Special:Interwiki does? If people limit the length of the regex, the warning message titles, and the reasons for their changes, it might be able to fit in there, especially if we make use of log_params.

Your implementation must be scalable to deal with at least 100x the current spam blacklists deal with today. I shouldn't be forced to ration spam blacklist entries for any reason (performance, usability, ...).

Is there anything specific you have in mind, with regard to the ways in which this needs to be optimized for performance?

There should be another special page that lists whitelist entries.

Do you have a mockup for that, or is it just going to look pretty similar to the blacklist page?

So now I'm starting to think, since this is going to have so many of the features and aesthetics of AbuseFilter, the best approach would be to just rip off most of the code from AbuseFilter, and rip off a small amount of code from SpamBlacklist (the part that compares the blacklist to the whitelist to see if it's going to be triggered). In a case like this where the needed functionality is so close to what already exists, might as well just rip it off wholesale, rather than merely take inspiration from it.