Page MenuHomePhabricator

Provide a list of zero-result searches
Closed, DeclinedPublic

Description

Author: dejan.papez

Description:
I'd like to propose the addition of a new special page that would display all the
searches that were unsuccessful (failed searches, no results searches, zero results searches). This would be a log of recently searched for but
unfound items. By sorting them on the basis of times they were searched for and not
found, we could more easily fill the gaps that still remain in the databases. This
would be complementary to Special:Wantedpages and Requested articles that are part of
some Wikimedia projects.

The log available to all users would contain the following data:

  • terms that have been looked for
  • their frequency (how many times they have been looked for)
  • the date of the first and of the last search for each unfound term.

It would also be very useful to be able to limit the displayed time period and to set
the sorting order (ascending/descending) and whether to include all results or only
those that have been searched for more than once in this time.

This would be very useful for smaller projects to know where to focus their limited
manpower, but also for larger ones to know which areas are not covered well and which
redirects should be created but have not been yet.

We've had a discussion here:

http://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(proposals%
29&oldid=86310881#List_of_recently_not_found_results

Three problems were mentioned:

  • possible server overload or impracticality due to server setup
  • privacy issues - hereby my opinion is that if the log does not show IPs or

usernames, the privacy is protected

  • someone would have to take his time to code this ;)

Version: unspecified
Severity: enhancement

Details

Reference
bz6373

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 9:20 PM
bzimport set Reference to bz6373.
bzimport added a subscriber: Unknown Object (MLST).
  • Bug 7969 has been marked as a duplicate of this bug. ***
  • Bug 26308 has been marked as a duplicate of this bug. ***

dejan.papez wrote:

I think, in addition to the reasons posted to the duplicates (using the system to improve the search system etc.), the feature would contribute most to the development not only of Wikipedia but particularly of Wiktionaries.

There was some discussion about that here:
http://www.gossamer-threads.com/lists/wiki/wikitech/186831

Also, Wikisticks provided a list of top missing search queries at one point (by rewriting search URLs so that they get collected in the squid logs, then removing existing pages), but it is broken now:
http://wikistics.falsikon.de/2008/wikipedia/en/wanted/

For the time being, User:West.andrew.g's list can be used to get this information: https://en.wikipedia.org/wiki/User:West.andrew.g/Popular_redlinks

Nemo_bis edited projects, added MediaWiki-Search; removed MediaWiki-Special-pages.
Nemo_bis set Security to None.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This was actually one of the very first things that the Discovery Department discussed when we began our work on search.

Search data, and in particular the search queries that users enter, is assumed to contain personally identifying information unless proven otherwise. This is because we're storing arbitrary text input by users, and if that arbitrary input is surfaced publicly then there are a variety of ways that malicious people could game the system and do nefarious things. Even stripping all metadata, and simply presenting a flat list of queries, does not resolve this issue. In light of this, the information is subject to the privacy policy unless it can be anonymised, which is very difficult to do in an automated fashion and very time consuming to do manually.

In summary, this is a good idea, but it isn't going to happen any time soon because of the above complexities.

I believe that privacy concerns should not and do not prohibit the use of failed search query data for the benefit of the projects searched.

As to spamming et cetera, counting repeated data from clustered requests, per time, per originator, and so on, can be collapsed into one counted request. That would considerably reduce the impact of such unfaithful uses.

We could also introduce a threshold below which data is not made public. That guarantees that at least several users have been searching for the same. This guarantees at least some sort of anonymity.

Last not least older searches have to be re-evaluated (are they successful now?) and may be dropped from the statistics, or their relevance discounted if they were not repeated.

I believe that privacy concerns should not and do not prohibit the use of failed search query data for the benefit of the projects searched.

The privacy aspect is probably more of a separate bug report if this feature could/should be enabled on the wmf cluster (which will probably need some comment from WMF-Legal) where as this bug is more about implementing the feature in MediaWiki core.

I believe that privacy concerns should not and do not prohibit the use of failed search query data for the benefit of the projects searched.

Understandable. That said, your belief does not change the privacy policy which we are bound to follow, and therefore also does not change what I have written above. :-)

As you pointed out, there are ways of reducing the risk, but we're not ready to work on this yet because those questions have not been fully resolved and this isn't bubbling up in priority for us.

The privacy aspect is probably more of a separate bug report if this feature could/should be enabled on the wmf cluster (which will probably need some comment from WMF-Legal)

The information I gave above about when this data could be released was the result of a consensus between WMF-Legal and Security when I asked them about it, so that bit is (fortunately!) already done.

where as this bug is more about implementing the feature in MediaWiki core.

That works with me. At this stage though, I should point out that the Discovery Department's primary focus is the users of the Wikimedia sites; we would not work on a feature unless we thought it had a reasonable chance of being used on the Wikimedia wikis. Of course, the above does not preclude someone else working on this, but it would mean that the support that Discovery could give would be minimal at best.

When I got the news of this extension the first thing came to my mind is that if we get the list of most search keywords on my wiki. We will be able to work on the missing articles. It would be a huge help to get more readers.

If we store all the user data along with the users then it might be an privecy issue. But if we only store the keywords then I do not think it should be a problem to anyone. To be safe we can add an option in the preferenves to 'exclude myself from this feature'.

If we store all the user data along with the users then it might be an privecy issue. But if we only store the keywords then I do not think it should be a problem to anyone.

Keywords could contain anything, including sensitive information. I've seen it in the logs with my own eyes. Unfortunately, your expectation does not map to the complicated reality of the situation. We should respect the advice of the legal and security experts that caution us about this, because they're right. :-)

As I've already said above, there are strategies that can mitigate this risk, but the Discovery Department cannot prioritise working on them at the expense of other work right now.

To be safe we can add an option in the preferenves to 'exclude myself from this feature'.

That barely improves safeness at all. More preferences buried in our complex preference system might make us feel better, and pat ourselves on the back, but it wouldn't help editors that don't notice the preference, or any readers who choose not to log in. Besides, building out such a feature would represent even more development effort, which as I've said, Discovery cannot prioritise right now.

Hi!

I wanted to post a recent investigation into this issue by @TJones - here's a snippet of it and the full analysis is here:

"I think the problem with all of these strategies is that so many high-frequency queries would be eliminated by any of them that any useful mining would be down to slogging through the low-impact long tail.

I don’t think there’s a lot here worth extracting, though others may disagree. The privacy concerns expressed earlier are genuine, and simple attempts to filter PII (using patterns, minimum IP counts, etc) are not guaranteed to be effective."

I'd like to close this ticket out - but wanted to get thoughts on it first.

May be worth quoting Dan's summary from the recent wikimedia-l thread:

  • The top 100 zero results queries are dominated by gibberish.
  • There's a long tail of zero results queries, meaning we'd have to reduce many more than the top 100.
  • Manually examining the top zero results queries is not a foolproof way of eliminating personal data since it's arbitrary user input.

Also, worth noting that this task is about zero-result searches, not about showing popular searches which are not exact matches for any article title ("redlink searches") another frequent request, which is less affected by privacy concerns. That's tracked in T115085, I think.

Tgr renamed this task from Provide a list of unsuccessful searches to Provide a list of zero-result searches.Jul 29 2016, 10:03 PM

debt writes:

I'd like to close this ticket out - but wanted to get thoughts on it
first.

Please leave it open. This isn't something that WMF is interested in,
but other users of MediaWiki are very much interested in this and don't
have the same long tail problem.

Cool! I'll just remove the Search backlog project tag. Thanks, all!

In T8373#2508574, @MarkAHershberger wrote:

debt writes:

I'd like to close this ticket out - but wanted to get thoughts on it first.

Please leave it open. This isn't something that WMF is interested in, but other users of MediaWiki are very much interested in this and don't have the same long tail problem.

I'd like to suggest closing this ticket again—though I'm happy to leave it open if anyone objects.

A few reasons to close it:

  • It's been 4½ years since @debt last suggested closing it, and nothing has happened.
  • Even if folks outside the WMF are willing to work on the long tail, they don't have access to the data (and unfortunately can't, without NDAs, because of the privacy concerns).
  • As noted in my analysis (also summarized somewhat less dryly in a blog post), there's actually not a lot there of much value, compared to the risk and effort required:
    • Privacy is definitely a concern in top-1000 most frequent queries—names, email addresses, physical addresses, etc., show up in the list.
    • Not only is the tail very long, there isn't really much of a head. Only 281 queries hit the 100-IP threshold in a month (out of ~7M unique queries); the 1000th most frequent query only occurred 48 times.
    • Top queries are often related to porn and internet memes, many of which have been explicitly deemed not notable (based on deleted articles), so they are getting no results "on purpose".
    • A month later, ~70% of the top-100 queries were the same, so there's even less value over time.
    • Almost all of the top-100 typos can be fixed by the completion suggester (suggestions as you type) or the did-you-mean suggestions, expect for a couple of poor spellings of "sex video" and "video porno".

Caveats:

  • The analysis is almost 5 years old, but I don't think much has changed. We have an internal dashboard that shows the most frequent queries (not zero-results queries, but all queries) and which groups similar queries (i.e., those that differ by upper- vs lowercase). It shows that the 1000th most frequent query over the last 12 weeks (which is Betty White, so not a zero-results query) only had ~300 occurrences. So, the head is still very thin, and the possibility of abuse still large.
  • This was all done on English Wikipedia, which has the highest query volume of any wiki. The results may vary on smaller wikis, but the possibility of abuse and privacy concerns are still there. Also, it may be hard to get enough data from very small wikis to do anything useful (especially if we have a frequency or IP-count limit).

Due to the privacy issues inherent with exposing this information, as well as the previous analysis done by @TJones that found low potential for value for this line of work, this ticket will be closed.