Add ability to search by user agent from CheckUser interface
Open, MediumPublic
Actions

Assigned To

None

Authored By

	kaldari
	Sep 27 2016, 11:30 PM

Description

The CheckUser interface currently surfaces user agents when listing users or edits, but there's no way to search by user agent. It would be nice if you could click on a listed user agent and it would then show you all users or edits performed by that user agent/IP combination.

This may be an expensive query, so we may have to introduce a new database column with hashed user agents that could be used as an index. This would preclude us from being able to do wildcard or prefix searches, however, (for example with a text field) but in almost all cases, the user will want to search for a specific user agent anyway, so I think that's a decent trade-off.

We will need to add one or more indexes to the table for this.

The user agent search will only be performed as a combined search for user agent and IP address. (If you just searched for user agent on its own, a common user agent would return too many results to be meaningful.)

There will be a second text field under the existing IP/username field that only appears when you click on a user agent link or a user agent is specified in the query string.

Description of the workflow, Get users:

User pastes IP address into "IP/username" field, chooses Get users
Results list has usernames, IP address and user agents. All three are links. User clicks on one of the user agent links.
On refresh, there is an "IP/username" field with the IP address filled in, and a second "user agent" field with the user agent filled in.
User can now search again for Get edits or Get users, and the results will only show this IP and user agent combination.
User can also blank the "user agent" field, if they don't want to use it anymore.
Alternatively, user can change the IP to a different IP or a CIDR range.
Note: The "IP/username" field allows for ranges. We should do the same for the "user agent" field, if possible.

Description of the workflow, Get edits:

User pastes IP address or username into "IP/username" field, chooses Get edits
Results list has log entries, with IP and user agent listed under each entry. IP and user agent are both links. User clicks on user agent link.
On refresh, there is an "IP/username" field with the IP address filled in, and a second "user agent" field with the user agent filled in.

Wireframes:

CheckUser user agents - box_2.png (539×905 px, 41 KB)

CheckUser user agents - get edits.png (461×1 px, 81 KB)

Related Objects
Search...

Status	Assigned	Task
Resolved	• TBolliger	T120734 Epic ⚡️ Improve MediaWiki's blocking tools
Open	None	T146837 Add ability to search by user agent from CheckUser interface
Open	None	T147894 Create index for agent_id columns in the CheckUser result tables
Open	None	T361139 Normalise the user agent column in CheckUser result tables
Open	None	T361208 Remove agent columns from CheckUser result tables
Open	None	T361206 Stop writing old for user agent schema migration on WMF wikis
Open	None	T361205 Stop writing old for user agent schema migration
Open	None	T361199 Set user agent schema migration config to read new on WMF wikis
Open	None	T361201 Set user agent schema migration config to read new in extension.json
Open	None	T361192 Read user agent strings from the cu_useragent table in Special:CheckUser
Open	None	T361193 Read user agent strings from the cu_useragent table in the CheckUser API
Open	None	T361195 Read user agent strings from the cu_useragent table in Special:Investigate
Open	None	T361198 Create a maintenance script to populate the cu_useragent table and agent_id columns with values from the agent columns
Open	None	T361196 Write to the cu_useragent table and agent_id columns on WMF wikis
Open	None	T361197 Write to the cu_useragent table and agent_id columns on by default in extension.json
Open	None	T361174 Write to agent_id columns in the CheckUserInsert service when migration configuration allows it
Open	None	T361172 Purge rows from cu_useragent
Open	None	T361173 Add schema migration config for cu_useragent table
Resolved	Dreamy_Jazz	T361140 Add user agent ID column to each CheckUser result table
Open	None	T361210 FYI: Changes to the cuc_agent column in the cu_changes table
Resolved	Dreamy_Jazz	T361928 Update mediawiki.org pages for schema changes made by adding the cu_useragent table
Resolved	Dreamy_Jazz	T359312 Create cu_useragent table
Resolved	• Marostegui	T361631 Unable to create table cu_useragent on labtestwiki
Open	None	T328289 update labtestwiki user and password
Resolved	• Marostegui	T361673 Filter cu_useragent on sanitarium
Resolved	• DannyH	T147895 Investigation: Test searching for user agents

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added subscribers: JEumerus, Aklapper. · View Herald TranscriptSep 27 2016, 11:30 PM

• DannyH moved this task from New & TBD Tickets to Needs Discussion on the Community-Tech board.Sep 28 2016, 12:10 AM

kaldari updated the task description. (Show Details)Sep 28 2016, 2:21 AM

kaldari added a subscriber: • DannyH.

Huji triaged this task as Low priority.Sep 28 2016, 3:35 AM

Huji added a parent task: T139810: RFC: Overhaul the CheckUser extension.

kaldari updated the task description. (Show Details)Sep 28 2016, 3:36 AM

• DannyH added a parent task: T120734: Epic ⚡️ Improve MediaWiki's blocking tools.Oct 3 2016, 6:17 PM

• DannyH mentioned this in T120734: Epic ⚡️ Improve MediaWiki's blocking tools.

• DannyH updated the task description. (Show Details)Oct 4 2016, 9:58 PM

@DannyH I don't think the wireframes you added are practical: there is no way to know for sure that a user with username "Mozilla/5.0 ..." doesn't exist. Or a user might edit with the user-agent "DannyH".

Perhaps a better option would be to have a dropdown which has three options (IP, user-agent, username) followed by a textbox in which you provide the data.

On a separate note, I think we should also allow searching in the UA using wildcards. I can think of the efficiency counter-arguments, but an "exact match only" search is not going to be that useful.

Hi @Huji,

Well, the user-agent is a long string of data, generated automatically. A full example is:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.27604.111 Safari/537.36

So you'd search for this by clicking on a user-agent string that you get from doing a query on an IP address or user name. You wouldn't just type it in; it would be too easy to make mistakes.

So you can't have a user-agent called "DannyH" or anything like that, and if we have a user with a name like "(that whole string of browser names and digits)", then I would like an explanation from that user for why they have such a ridiculous username. :)

I agree that we still have to figure out how useful the exact match will be, compared to wildcards. We need to test out searching for user-agents, and see if we get enough results, or too many.

In T146837#2691166, @DannyH wrote:

Well, the user-agent is a long string of data, generated automatically. A full example is:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.27604.111 Safari/537.36

A user-agent is "usually" like that. It does not have to be. To my knowledge, there is no RFC that enforces a specific format for user-agent strings, and even if there is, there is no guarantee that a user would not modify their user-agent to something non-standard. We want the CU tool to know if it is searching for "DannyH" in the user field or the UA field. This cannot be decided just by looking at the input string.

So you can't have a user-agent called "DannyH" or anything like that

Yes you can. It is not illegal. In fact I have seen users with really ridiculous user-agent strings.

In T146837#2691374, @Huji wrote:

So you can't have a user-agent called "DannyH" or anything like that

Yes you can. It is not illegal. In fact I have seen users with really ridiculous user-agent strings.

You could spoof it, but the browser would seemingly only give a standard-ish user agent by default, otherwise it may get unwanted content. E.g. DannyH might result in a blank page with "your browser is not supported". I would not be surprised if the more prolific LTAs are spoofing user agents into something ridiculous like this, though. If that is the case, than being able to search for it would certainly be helpful given the unlikelihood of others with a similar UA. Also many browsers automatically update, so you might want to do a wildcard search omitting the browser version to allow for some variation.

I think Huji is right. There's really no way to reliably differentiate between a UA string and a username. In fact a lot of my bots have very simple user agent strings like "WikiTiki 1.0". The suggestion of a drop-down select list sounds like a good idea.

Okay, that can work. Thanks for helping to educate me on user-agents. :)

Thanks @DannyH. To briefly respond to your second last comment above, using a non-standard UA generally doesn't result in any denial of service.

• DannyH updated the task description. (Show Details)Oct 6 2016, 5:48 PM

kaldari updated the task description. (Show Details)Oct 6 2016, 5:51 PM

kaldari updated the task description. (Show Details)Oct 6 2016, 6:03 PM

I've updated the wireframes, with a drop-down for IP address, Username and User agent.

Huji added a project: MediaWiki-libs-Rdbms.Oct 7 2016, 12:12 AM

• DannyH raised the priority of this task from Low to Medium.Oct 11 2016, 9:25 PM

kaldari created subtask T147894: Create index for agent_id columns in the CheckUser result tables.Oct 11 2016, 9:28 PM

• DannyH created subtask T147895: Investigation: Test searching for user agents.Oct 11 2016, 9:28 PM

kaldari moved this task from Needs Discussion to Older: Team Work on the Community-Tech board.Oct 18 2016, 10:18 PM

• DannyH updated the task description. (Show Details)Oct 29 2016, 12:36 AM

• DannyH updated the task description. (Show Details)Oct 29 2016, 1:22 AM

I've revamped the spec and wireframes, following our meeting with @Jalexander. Now you can only search by user agent and IP together.

Huji updated the task description. (Show Details)Oct 29 2016, 2:34 AM

Niharika mentioned this in T147895: Investigation: Test searching for user agents.Nov 3 2016, 12:37 PM

• DannyH closed subtask T147895: Investigation: Test searching for user agents as Resolved.Nov 3 2016, 10:30 PM

Trijnstel subscribed.Nov 26 2016, 9:33 PM

Matanya merged a task: T33712: CheckUser User-Agent check feature.Jan 11 2017, 11:17 PM

Matanya added subscribers: • bzimport, • MZMcBride, Krenair and 2 others.

• DannyH added a project: Stewards-and-global-tools.Jan 11 2017, 11:17 PM

Matanya merged a task: T50623: Add user-agent variable for abuse filtering.Jan 11 2017, 11:19 PM

Matanya added subscribers: Billinghurst, RuyP, Callanecc and 2 others.

Billinghurst unsubscribed.Jan 12 2017, 12:40 PM

MarcoAurelio subscribed.Jan 19 2017, 12:24 PM

• TBolliger subscribed.Mar 2 2017, 12:34 AM

A few thoughts:

If we're only expecting users to provide the user agent by clicking it from a previous search, perhaps there is no input box at all? This could simplify the process.
Given the complexity, this system feels prone to displaying error messages. We'll need to list out the messages that could occur and make sure they use clear language.
Is there a loading indicator? Should we add one?

In T146837#3082493, @TBolliger wrote:

A few thoughts:

If we're only expecting users to provide the user agent by clicking it from a previous search, perhaps there is no input box at all? This could simplify the process.

But why? I think this is a useful feature. We should allow as many easy ways to search UAs as we can.

• TBolliger added a project: Anti-Harassment.Mar 9 2017, 7:08 PM

Matanya moved this task from Untriaged to High priority on the Stewards-and-global-tools board.Apr 19 2017, 7:36 PM

Krinkle removed a project: MediaWiki-libs-Rdbms.May 8 2017, 1:26 AM

TheresNoTime subscribed.Dec 21 2017, 10:37 PM

TheresNoTime awarded a token.Dec 21 2017, 10:43 PM

• TBolliger removed a project: Community-Tech.Feb 23 2018, 11:49 PM

Restricted Application added subscribers: MGChecker, alaa. · View Herald TranscriptFeb 23 2018, 11:49 PM

• TBolliger moved this task from Untriaged to Product/Tech backlog on the Anti-Harassment board.Mar 9 2018, 1:55 PM

• TBolliger mentioned this in T100070: Allow CheckUsers to set User agent (UA)-based IP Blocks.Apr 4 2018, 8:41 PM

Will it be possible to search for IPs using given UA? For example, UA "DannyH" will usually be uncommon and searching for it can be useful.

In T146837#4229650, @Urbanecm wrote:

Will it be possible to search for IPs using given UA? For example, UA "DannyH" will usually be uncommon and searching for it can be useful.

Yes, that is one of the use cases for this tool. It will work similar to IP addresses in the CheckUser tool.

0x010C subscribed.Oct 31 2018, 5:58 PM

MER-C mentioned this in T213875: Explore alternatives to browser fingerprinting for anti-abuse efforts.Jan 24 2019, 2:36 PM

• TBolliger removed a project: Anti-Harassment.Jan 30 2019, 9:42 PM

One note on technical side of things: You can use a hash index. In that case you would lose the ability to do regex searches and pattern matching but it's pretty fast. This is a classic example of hash index lookup, I've done this before for other cases.

Huji mentioned this in T147894: Create index for agent_id columns in the CheckUser result tables.Feb 18 2019, 10:23 PM

Please see my related comment in T147894#4962824 in which I explain why, at least for now, it is best to restrict the functionality to searching either by IP or by UA (so if both are provided, we would return an error and ask the user to remove one). It is possible (though I am not sure how likely) that allowing both the IP and the UA to be specified would translate to the need for a massive index on the database tables that would be unjustifiable given the few use cases for a joint search, so I would rather differ that to a later time in the interest of having the UA search in production in near term.

• DannyH unsubscribed.Feb 19 2019, 9:45 PM

Beetstra subscribed.Oct 11 2019, 6:44 PM

A way to circumvent the large index is to turn this into something like an abusefilter for checkusers only. Get alerted when someone in a range uses a recognizable UA is a gazillion times better that finding a sock after 50 edits, waiting for CUs to check and confirm while the sock is already on a next account.

@Beetstra what you said reminded me of T50623: Add user-agent variable for abuse filtering and I think the limitation there is that AbuseFilter currently does not have a way to restrict certain aspects of a filter (or ability to view or edit certain filters and their associated logs) to a specific group like CUs.

But, while that is a useful feature, it is still useful to also be able to run CU using UA. There is a Venn diagram here, with the two circles not fully overlapped.

The only way I would see is that there is an abusefilter-variety that is
enabled for checkusers (so a separate one). It was however confessed to me
that the AbuseFilter itself needs a serious upgrade, so I can imagine that
a CU-clone of it is not soon going to happen.

I do note that AbuseFilter does require a special right to see filters and
results, I don’t know how difficult it is to make another right that allows
even deeper hidden filters.

• Demian mentioned this in T237486: Organize tasks related to the overhaul of CheckUser.Nov 6 2019, 8:24 AM

• Demian edited parent tasks, added: T237034: CheckUser 2.0: Input form; removed: T139810: RFC: Overhaul the CheckUser extension.Nov 7 2019, 6:50 AM

dbarratt removed a parent task: T237034: CheckUser 2.0: Input form.Nov 7 2019, 10:20 PM

DannyS712 subscribed.Oct 16 2020, 9:26 PM

Huji added a subtask: T305930: Normalize cu_changes table.Apr 12 2022, 6:23 PM

Hymeros awarded a token.Oct 9 2022, 9:15 PM

Dreamy_Jazz mentioned this in T258105: Implement storage for User-Agent Client Hints header data.Jun 3 2023, 11:11 PM

Dreamy_Jazz mentioned this in T351944: Create indexes cuc_this_oldid for cu_changes and cule_log_id for cu_log_event.Nov 27 2023, 6:53 PM

Dreamy_Jazz edited subtasks, added: T361139: Normalise the user agent column in CheckUser result tables; removed: T305930: Normalize cu_changes table.Mar 28 2024, 9:25 AM

Johannnes89 subscribed.Apr 17 2024, 7:41 AM

kostajh mentioned this in T372651: Recommend users for further investigation based on similarity scores of unique device identifiers.Fri, Aug 16, 3:26 PM

	F4672128: CheckUser user agents - get edits.png
	Oct 29 2016, 1:28 AM

	F4672119: CheckUser user agents - box_2.png
	Oct 29 2016, 1:22 AM

	F4569912: CheckUser user agents - get IP.jpg
	Oct 6 2016, 10:14 PM

	F4569910: CheckUser user agents - get edits.jpg
	Oct 6 2016, 10:14 PM

	F4569917: CheckUser user agents - get users.jpg
	Oct 6 2016, 10:14 PM

	F4557231: CheckUser user agents - get users.jpg
	Oct 4 2016, 9:58 PM

	F4557225: CheckUser user agents - box.jpg
	Oct 4 2016, 9:58 PM

	F4557228: CheckUser user agents - get IP.jpg
	Oct 4 2016, 9:58 PM

Add ability to search by user agent from CheckUser interfaceOpen, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Add ability to search by user agent from CheckUser interface
Open, MediumPublic
Actions

Related Objects
Search...