Page MenuHomePhabricator

Create CheckUser-level abuse filters
Open, Needs TriagePublic

Description

This is something that several people have reported as useful. It was also mentioned in the talk page on meta about masking IPs.

Basically, just as we have public and private filters, we could add another visibility level (like "sensitive") for CUs only. Filters with "sensitive" visibility could contain explicit NPI mentions, and we could allow using new variables. For instance, the UA (T50623), maybe the IP even for registered users (T155553).

This idea has great potential, but there are various things to discuss:

  1. From a legal standpoint, would this be OK? Speaking about active filters, the access is sort of logged.
  2. What to do with /test etc. Using a "sensitive" variable to test a filter against recent changes is currently not logged.
  3. More generally, we don't have any system in place that would prevent people from using certain variables in certain contexts.
  4. A filter marked as "sensitive" should always keep the same visibility, to avoid any possible info leak.
  5. In theory, a schema change is not mandatory (af_flags is a commalist of flags); in practice, I'd strongly recommend doing that instead (i.e. add a new column, like "af_sensitive"; maybe in abuse_filter_history, too).

Speaking for myself, I do not plan to engage in such a work in the short term. Aside from the blockers above, the AF is currently undergoing a bunch of relatively big changes, which I'd like to prioritize. Should a team (AHT?) be interested in exploring the idea (e.g. as part of the IP masking project), I can help.

Related Objects

Event Timeline

Niharika subscribed.

@Daimona This is an interesting idea. Anti-Harassment team is prioritizing making improvements to CheckUser in the near term. I don't think we can commit to also taking on AbuseFilter right now but this project is on our minds for the future work regarding IP masking.

@Daimona This is an interesting idea. Anti-Harassment team is prioritizing making improvements to CheckUser in the near term. I don't think we can commit to also taking on AbuseFilter right now but this project is on our minds for the future work regarding IP masking.

Glad to hear that! As I said, I'm not prioritizing this either. However, I'm still available for help, suggestion, etc. if someone else is planning to take over this. Or also to take a share of the work to do.

Based on what I read on meta a while ago, and requests by checkusers, this idea was requested by a handful of people to better counter vandalism if we mask IPs.

Partial list of what to do (barring legal objections):

  • Introduce a new visibility level, with the following columns: af_sensitive on abuse_filter, afh_sensitive on abuse_filter_history, and afl_sensisive on abuse_filter_log
  • Audit all queries on all of these tables, and ensure to add proper checks for af_sensitive.
  • The UI in ViewEdit should have another checkbox for marking a filter as sensitive. Some pretty JS should disable the checkbox if the filter being edited is sensitive (note: don't do this when a filter is being created)
    • Q: What to do with the checkbox for hiding a filter? In theory, this should be superfluous. In practice, it depends on how user rights are assigned
  • On Special:AbuseFilter, there should be an option for showing sensitive filters only. Those filters should not be searchable for non-privileged people.
  • The same should be done for APIs.
  • When a filter is being edited, the edit should be disallowed if it tries to remove the "sensitive" flag from a sensitive filter
    • Q: What if a filter is marked as sensitive by mistake? We could add a check to the code of the filter, and still allow removing the sensitive flag if it doesn't contain sensitive variables. But this should be applied to past revisions of that filter, too. I guess we can just forbid it for now;
  • Some parser-related class should be able to tell whether the code contains any sensitive variable. This is something that we can only determine at parsing time; this method could live in AFPSyntaxTree, or AFPTreeParser, or in AbuseFilterParser.
    • todo: Hard to tell for now, due to the big ongoing refactors to the AF parsers (mostly, we need to get rid of the old parser).
  • Code marked as sensitive should not be allowed in non-sensitive filters, /examine, /test, and details of an AbuseLog entry. This applies to the API modules, too. For filter edits, this should be checked upon saving.
    • In theory, they can be allowed in /tools, since it doesn't allow testing against actual data. In practice, it's probably more convenient to disallow it as well, in case something is changed in the implementation.
  • Whether or not edits to sensitive filters should be shown on Special:Log/abusefilter doesn't matter (there's no difference with T34959).
  • Hits to sensitive filters should not be shown on Special:AbuseLog if the user is not privileged.
  • Likewise, edits to sensitive filters should not be shown on Special:AbuseFilter/history.
  • Note: we'll need a list of variables available in sensitive filters. Right now I can think of the user agent (obviously). However, we don't really need to have a list right now.

One important thing: logging. Due to the nature of abuse filters, it's impossible to log when private information is being accessed (because "logging" doesn't make any sense). We would already be logging:

  • Whenever a sensitive filter is edited
  • Whenever a sensitive filter is tripped

But we cannot currently log:

  • When sensitive variables are used in /test, /examine, etc.
  • Whenever a sensitive filter is not tripped. Which is to say: if a filter consists of user_agent === XXX, and an edit by User:Y isn't caught, then the UA of Y is not XXX.

I've been thinking about the logging part, and I came out with the following implementation:

  • AbuseLog entries for restricted filters are always hidden by default; viewing the AbuseLog for a given filter (NB I'm talking about the list of log entries, not a single log entry) will be logged;
  • Once the AbuseLog for a restricted filter is accessed [I guess we could make this last for the web session, I'm unsure how that's handled e.g. in CheckUser], accessing the details of single entries will cause an additional log entry. I believe we don't need to ask for a motivation again, within the same session.
  • Using the test interface with restricted variables will first give a list of testable actions, WITHOUT filter results; then, users may use checkboxes to select which actions to test, and those will be logged together after the form is submitted again.
  • /examine will also require a reason to examine a given edit.

All of the logging above will happen on Special:Log/log_name, and several subtypes will be available.

We recently encoutered a case of doxing of several Wikipedians on fr-wp, which led us to hope for CU and/or OS level abusefilters: CU for a future user agent variable (T50623); OS to filter real names of Wikipedians with better privacy.
Should I open a new ticket for OS-level abusefilters, or should it be added to this ticket?

We recently encoutered a case of doxing of several Wikipedians on fr-wp, which led us to hope for CU and/or OS level abusefilters: CU for a future user agent variable (T50623); OS to filter real names of Wikipedians with better privacy.
Should I open a new ticket for OS-level abusefilters, or should it be added to this ticket?

I guess it should be a new ticket, but let me set clear expectations: it's unlikely that this ticket will be worked on (soon): it would require massive changes, and it's complicated by the upcoming fade-out of the User-Agent header, as well as the IP masking thing (possibly). The former reason also holds true for OS-level filters.

Thank you for your reply @Daimona. I opened a new ticket, T290324; I understand that it will not be worked on quiet soon.

Reedy renamed this task from Create CU-level abuse filters to Create CheckUser-level abuse filters.Sep 3 2021, 1:01 PM

Something else that I don't think has been said: Filter log entries would need to be deleted (as in "deleted for real with no backup") after 90 days, as per the data retention policy. Not just the "details", but the whole thing. If my IP happens to fall in the same range as some LTA, I don't want want that fact recorded forever.

Alternatively we can introduce optional CU protected variable and the IP/UA can only be compared with such restricted variables, and the filter otherwise can be public. (For variables shared across filters see T120740, but a filter-local variable may still be possible)

If we want abuse log be restricted, we can automatically compute how a log entry is restricted and do not show those restricted abuse log by default. There are two ways to calculate it, the rough way is treat filter (versions) that use restricted variables and every hit thereof as restricted; the more fine-grained way is when actually running filter track whether restricted variables is actually read (it may be not since evaluation can be short cut).

This will introduce a restriction level in each abuse log entries (by comparsion the restriction level of filter can be automatically calculated, or not calculated at all in the fine-grained way, since abuse filter is mutable).

One of the issues we'd likely run into when implementing this is that AbuseFilter makes use of the external storage for storing its data for revisions. The external storage is append only, meaning data once added to it cannot be altered or deleted, which makes it pretty much impossible to add CU-level information to it.

It might make sense to create a secondary backend for AbuseFilter variables, one that would be backed by the regular database, and periodically purged to comply with the 90 days data retention window. If we do that, we would be able to include IP addresses or user agents (or anything similarly-sensitive) into AbuseFilter, which would allow checkusers to benefit the most from CU filters.