Page MenuHomePhabricator

Puppetize a server with a role that sets up Cassandra on Analytics machines [13 pts] {slug}
Closed, ResolvedPublic

Description

Background

Some members of the community (most notably Magnus Manske and GLAM folks) have asked us to provide a pageview API for what I understand is about 11 years at this point.

WMF has been releasing page view data, aggregated hourly by page title and zipped: http://dumps.wikimedia.org/other/pagecounts-ez/merged/ since 2011. Some volunteers take this data and serve it up in a pageview API, the most prominent of these is: http://stats.grok.se/. A lot of people rely on this service, but it can be unreliable at times.

Recently, we've been getting more and more requests internally for a pageview API. Different readership teams want to analyze different types of pageviews. Some people are talking about serving the pageviews per article as part of the front end interface. These internal requests are not currently addressed by our solution, but we have them in the back of our mind.

Proposed Solution

  • the main RESTBase instance proxies to our RESTBase cluster, to a new "pageviews" module (done but waiting on finalized plans to submit a pull request)
  • Three servers will be needed to run Cassandra and RESTBase (added hardware-requests task as a blocker)
  • This ticket will serve as the coordinating ticket and the one linked to the puppetization change in gerrit
  • Hadoop pushes data into Cassandra (this code is done and working, just need to open the necessary ports once we stand up the servers)

Related Objects

Event Timeline

Milimetric raised the priority of this task from to Needs Triage.
Milimetric triaged this task as Medium priority.
Milimetric updated the task description. (Show Details)
Milimetric added a project: Analytics-Backlog.
Milimetric set Security to None.
Milimetric moved this task from Incoming to Medium on the Analytics-Backlog board.
Milimetric moved this task from Medium to Tasked on the Analytics-Backlog board.
Milimetric added subscribers: Milimetric, Aklapper.

Change 231574 had a related patch set uploaded (by Milimetric):
[WIP] Add an Analytics specific instance of RESTBase

https://gerrit.wikimedia.org/r/231574

Hey DevOps Guys,
As part of that task, we would need the cassandra cluster to beaccessible from the hadoop cluster to load the data.
We would access cassandra using CQL native on the port 9042.
Thanks !

Change 231574 merged by Ottomata:
Add Analytics Query Service role

https://gerrit.wikimedia.org/r/231574

Hey DevOps Guys,
As part of that task, we would need the cassandra cluster to beaccessible from the hadoop cluster to load the data.
We would access cassandra using CQL native on the port 9042.
Thanks !

The network part of the configuration is done. Port TCP 9042 of the aqs cluster is accessible to machines of the analytics subnet. There is still the issue however of the ferm firewall that needs configuration

Change 243635 had a related patch set uploaded (by Alexandros Kosiaris):
aqs: Allow CQL access from analytics

https://gerrit.wikimedia.org/r/243635

Change 243635 merged by Alexandros Kosiaris:
aqs: Allow CQL access from analytics

https://gerrit.wikimedia.org/r/243635