Page MenuHomePhabricator

Special:NewPageFeed - add option to filter by pageviews
Open, Needs TriagePublic

Description

Add functionality to sort the NewPageFeed by pageview count, so that Reviewers can prioritise high impact articles.

Originally proposed here: https://en.wikipedia.org/wiki/Wikipedia:Page_Curation/Suggested_improvements#42._Filters_by_a_score_of_estimated_public_interest

Event Timeline

Niharika renamed this task from Special:NewPageFeed filter by estimated public interest to Special:NewPageFeed - add option to filter by pageviews.Jan 18 2019, 5:02 AM
Niharika subscribed.
JTannerWMF subscribed.

It appears the CommTech team is working on this.

The Community Tech team has evaluated this request, which included an investigative ticket: T225169. The work presents significant challenges, but there may be an alternative solution. We have posted the following update to Meta-Wiki and Wikipedia, but I'll add the details to this ticket as well:

• First, the challenges (according to analysis from the engineering team): In order to filter/sort by inputted numbers, the numbers must be stored in the database in a specific manner. This first step alone would take several weeks, if not months, according to the estimates provided by Wikimedia database experts. Then, we would need to populate the sortable cells with pageview data, which comes from an external service. To do this, we would need to create a process that pulls the data from the external service and stores it in MediaWiki’s PageTriage table. Then, we would do this work repeatedly, so that the numbers would remain up-to-date, over the entire PageTriage database (which consists of tens of thousands of rows, if not more). This process is both uncommon (in MediaWiki servers) and complex; we would need to define this process and identify the correct way to implement it, in collaboration with Operations and Database experts. In total, we do not find the request, in its current form, within our scope. For more details on the technical analysis and discussion with the database administrators, you can check out the associated investigation ticket.
• Second, the alternative solution (as described in the T225169 investigation): We could display the number of pageviews in the article record, without allowing for sorting or filtering. Would this be a satisfactory alternative to the community? And, if so, how would you like the number of pageviews displayed (e.g. average per day, median per day, total views in the last 30 days, etc)? Note that the results displayed will be from 24 hours earlier than the display time, and we’ll want to query from a maximum of 30 days ago (for the sake of general efficiency and manageability of this feature). We do not yet know if we can do this work — but, if we could, would it be worth our time and effort, in your opinion?

Update: We have created a separate ticket for the proposed work below (T230567)

This proposal in T230567 failed to reached consensus, so we'll leave it as an open ticket, if things change at a later date and another team would like to take it on. I'm removing the Community Tech tag from this ticket, as we've now wrapped up the Page Curation Improvements project. More details on the project and its final outcomes can be found on the Page Curation Improvements project page. Thanks!

Alternative: We don't actually need the exact number of page views, just a general sense of popularity. You can do that by storing (ceil) log-base10 of the pageviews as a page_tag. That way there are a limited number of distinct values in the tag, and the reviewer has a general sense of the popularity of the article.

Seems like pageviews is not stored in MediaWiki core's SQL database, but via a service with its own API: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews

Some of these numbers are exposed to enwiki's &action=info pages (example) via an extension: https://www.mediawiki.org/wiki/Extension:PageViewInfo. Looking at the code in this extension could be helpful if we end up writing code that does something similar.

Sounds like one of the proposed ideas above was to write a maintenance script in PageTriage that calls the above mentioned API, then stores the data somewhere (probably in the PageTriage pagetriage_page_tags table).

Then to configure this to run at a specified interval via a cron job. Here's an example cron job, to give an idea how to set it up in Wikimedia production. Although that would be the very last step, of course.

Overall the work to value ratio seems a bit high to me, so I would not personally work on this. But this could indeed be a good fit for a student, who is less focused on efficiency and more focused on just learning.

Looks like some of the investigations turned up concerns about the SQL query efficiency of storing this info in the pagetriage_page_tags table too. Another thing to watch out for. Could require adding SQL indexes or something.

Change #1069651 had a related patch set uploaded (by Rockingpenny4; author: Rockingpenny4):

[mediawiki/extensions/PageTriage@master] Added functionality to filter articles by pageviews

https://gerrit.wikimedia.org/r/1069651

Test wiki created on Patch demo by Rockingpenny4 using patch(es) linked to this task:
http://patchdemo.wmcloud.org/wikis/ddc2c7f026/w/

Change #1069651 had a related patch set uploaded (by Sohom Datta; author: Rockingpenny4):

[mediawiki/extensions/PageTriage@master] Adds functionality to filter articles by pageviews

https://gerrit.wikimedia.org/r/1069651

Test wiki on Patch demo by Rockingpenny4 using patch(es) linked to this task was deleted:

http://patchdemo.wmcloud.org/wikis/ddc2c7f026/w/

Change #1094569 had a related patch set uploaded (by Rockingpenny4; author: Rockingpenny4):

[integration/config@master] Zuul: [mediawiki/extensions/PageTriage] Add PageViewInfo dependency

https://gerrit.wikimedia.org/r/1094569