Page MenuHomePhabricator

% of content read by % of people
Closed, ResolvedPublic

Description

Hey Tilman - not sure if this is the right project tag, so assigning to you, too.

I heard 2nd hand from the One Laptop Per Child community summit recently that 80% of Wikipedia traffic is to 3% of the content. I have no idea where this number came from, but it'd be good to know as we're looking into offline strategies.

Event Timeline

Interesting question! One can think of various ways to make it into a precise hypothesis that can be tested via a query; I think the following should work:
Determine the percentage of total Wikipedia pageviews in November 2016 that went to the 3% most viewed articles (that's 1.281 million articles based on this total number).
I can run a query for this soon.

(Of course this is just one way of looking at it. It ignores, for example, that content (text) is not equally distributed among articles - article length probably correlates positively with traffic -, and that not every reader will read all the content on a page they are accessing, or spend the same amount of time on it.
You may also be interested in the research summarized here:
https://meta.wikimedia.org/wiki/Research:Newsletter/2015/April#Popularity_does_not_breed_quality_.28and_vice_versa.29 .)

Hm.. yeah the most viewed articles is sort of the most practical way to
look at this. We can then think about finding ways to deliver that content
to people in offline contexts, while saving them the data for the rest that
they're unlikely to use.

The thing that's the most interesting to me isn't the 3% specifically, it's
if/where there is a significant drop-off. Would it be possible to do
something like that?

Thanks for explaining more of the background! In that case it is probably more useful to plot the entire distribution of pageviews over article rank, e.g. as a cumulative histogram (y = percentage of pageviews for the top x% of articles). I can do that too; until when would you need it?

BTW I guess it's not impossible that this question has already been studied in the context of other offline projects like WP 1.0, but the only related work I'm aware of right now is this plot from 2011 by Andrew West who observed a power-law distribution of pageviews on enwiki.

This sounds great, @Tbayer. Thank you. When you say "until when" do you
mean a duration of the data or a deadline? :)

@atgo The latter, see also the recommendations about such Phabricator requests in the Research FAQ ("When it's requested.") Regarding the former, focusing on the data of November 2016 should still work, right?

Given holidays and all hands, end of January sound reasonable? I'm not sure what all it takes for you and what your other commitments are.

@atgo Sure, end of January works. (It should not be a huge task, although it's not completely trivial either. The question was more about specifying the needs for this data on your side, in particular the point where it would decrease in usefulness ;) See also the Wiktionary developer example in the FAQ.)

@Tbayer do you have a specific question? I think I've given you the timeframe you requested, and not sure what else you're looking for me to pick up from this wiki page.

Yes, November 2016 for data should work.

Hey @Tbayer thinking about this more - could we do this per language for at least some languages? that will be way more useful than the overall #s.

Yes, restricting it to an individual Wikipedia should be doable with little additional work. Which languages would you be particularly interested in?

@Tbayer How about: en, hi, es, fr, ar for starters?

Another thing we may want to consider: is there a lot of overlap in the
top-read articles across languages? Just wanted to mention this in case it
influences how you approach the work. I think we can do this as a separate
follow-up later.

Hey @Tbayer, wanted to bump this. Thoughts on an ETA?

Hey @Tbayer, wanted to bump this. Thoughts on an ETA?

Still end of this month, as per our earlier discussion. (As always, feel free to alert me in case it threatens to become a blocker for your work, and I'll try to prioritize it sooner.)

By the way, coincidentally, @Menner from the German Wikipedia has since posted preliminary results from a another related study for dewiki and enwiki. This is not about the exact same question as posed in the task here, but the charts shed some light on the "is there a drop-off" question (see also Menner's write-up for the Kurier newsletter on dewiki, with community discussion, both in German, examining the question whether the pageviews follow (a) Zipf's law. ) And it has some concrete data points relevant here: On enwiki, the top 59k articles (around 12% of the overall 5 million articles) capture 50% of the pageviews, the top 338k get 80%, and the top 1.46 million get 95%. I suppose @Menner might be able to run this analysis for more wikis and also plot it in the "% of content read by % of people" form, but considering that the method there - online pageview API requests, for lack of direct datbase access - is apparently quite work intensive ("five Saturday mornings" so far, with final results for these two wikis to be posted in February or March), I'm still going to go ahead with the database query as planned.

Thanks for sharing and for the quick reply, @Tbayer. I'll definitely check
out those related reports.

ETA update: I wrote a query for this and tested it successfully on a small dataset, but unfortunately it hit an unexpected Hive timeout issue yesterday when running it over the full data for eswiki. Joseph from Analytics Engineering looked at the problem and while the reason is not quite clear yet, he recommended to simply increase the MapReduce timeout limit for now, which is what I'm trying out right now. I need to focus on other things until Friday though, so I would like to come back to completing this at the beginning of next week if that works for you @atgo.

Hey @Tbayer! Wanted to check - were you able to get around the timeout issue? If it's proving to be really complex, maybe we can work through another way to think about this and get the information I'm looking for.

Also I'm trying to go back and look at @Menner 's linked reports but only one of the links is still active. I don't speak German, any chance you could help me find the revision that would have this information? I can do a google translate or something to get the gist if I have the content :)

Hey @Tbayer! Wanted to check - were you able to get around the timeout issue?

Unfortunately not; I worked on this subsequently with @JAllemandou and it appears that the timeouts were actually caused by an unresolved bug in Hive (where this particular kind of query results in an endless loop - we had to kill the query after a day or so; by coincidence we also encountered this issue in another analysis recently).

If it's proving to be really complex, maybe we can work through another way to think about this and get the information I'm looking for.

Yes, it was unexpected that the straightforward way to do this fails in this way, and we didn't see an easy workaround. I have another approach in mind which should not take too long to try out (using PAWS Internal instead of calculating things directly in Hive). I should be able to tackle that in the next few days. Thanks for the followup and sorry for not posting an update earlier.

Thanks @Tbayer. I'm really interested to learn more about PAWS and start
using it. Much appreciated.

Also I'm trying to go back and look at @Menner 's linked reports but only one of the links is still active. I don't speak German, any chance you could help me find the revision that would have this information? I can do a google translate or something to get the gist if I have the content :)

The updated (archived) version of the links I posted above:
dewiki chart
enwiki chart
writeup (in German)

The updated (archived) version of the links I posted above:
dewiki chart
enwiki chart
writeup (in German)

The pure statistic view has a major flaw. A large part of Wikipedia consists of regional information about sites and culture. Offline regions cannot produce much traffic thus their content is low ranked. So you miss a major aspect of your target group by design.

It depends now what you intend with your offline Wikipedia. To open a view to the world and general information it's a suitable approach. To reach the people with the "Wikipedia spirit" and cultural empowerment it's less suitable.

A short translation and summary of my diagrams on en.Wikipedia

50% of all page views (Seitenabrufe/Abrufe) go to the top59k (Seitenrang/Rang) articles and about 1.2% of all articles in en
80% of all page views go to the top337k articles and about 7.3% of all articles in en
96% of all page views go to the top1445k articles and about 28% of all articles in en

Articles from top59k have at least 25k page views per month
Articles from top337k have at least 2.7k page views per month
Articles from top59k have at least 270 page views per month

Regarding numbers for de.Wikipedia

50% of all page views go to the top30k articles and about 1.5% of all articles in de
80% of all page views go to the top153k articles and about 8.3% of all articles in de
96% of all page views go to the top618k articles and about 32% of all articles in de

If you have a specific language in mind it could make a similar distribution estimation based on random sampling.

Aklapper added a subscriber: Tbayer.

Resetting task assignee as the user is not active here anymore. (Plus wondering what's left in this task to do.)

@atgo this is an old task that looks like it was resolved in 2017; can I close it?

kzimmerman claimed this task.