Background
Some members of the community (most notably Magnus Manske and GLAM folks) have asked us to provide a pageview API for what I understand is about 11 years at this point.
WMF has been releasing page view data, aggregated hourly by page title and zipped: http://dumps.wikimedia.org/other/pagecounts-ez/merged/ since 2011. Some volunteers take this data and serve it up in a pageview API, the most prominent of these is: http://stats.grok.se/. A lot of people rely on this service, but it can be unreliable at times.
Recently, we've been getting more and more requests internally for a pageview API. Different readership teams want to analyze different types of pageviews. Some people are talking about serving the pageviews per article as part of the front end interface. These internal requests are not currently addressed by our solution, but we have them in the back of our mind.
Proposed Solution
- the main RESTBase instance proxies to our RESTBase cluster, to a new "pageviews" module (done but waiting on finalized plans to submit a pull request)
- Three servers will be needed to run Cassandra and RESTBase (added hardware-requests task as a blocker)
- This ticket will serve as the coordinating ticket and the one linked to the puppetization change in gerrit
- Hadoop pushes data into Cassandra (this code is done and working, just need to open the necessary ports once we stand up the servers)