Page MenuHomePhabricator

A public API for Knowledge Gaps Metrics (Internship)
Closed, ResolvedPublic

Description

The project in this internship aims at making the Knowledge Gaps datasets publicly available and browsable. The goal is to make metrics for the "knowledge gap index" available via an API and visualize them in WikiStats, a public tool to browse aggregated data about the Wikimedia projects, maintained by the Data Engineering team .

The project is divided into 4 main parts:

  • Design input / output format for the API requests, and format the knowledge gaps metrics data accordingly
  • Export the knowledge gaps data from the Analytics cluster to the Cassandra Cluster
  • Implement an AQS service for serving content gap metrics via the Wikimedia API
  • Add support to the Wikistats tool to visualize content gap metrics

Event Timeline

Today, Dan and I had a long pair programming session.
We spent most of it debugging the test I tried to set up for the dummy knowledge gaps endpoint

Progress was made but the issues aren't fully resolved and the tests don't run yet.
Work continues

Knowledge Gaps Endpoint Tasks

Flow so far:

  • Data exploration of the existing knowledge gaps database. Particular focus on the content gap metrics table.
  • Design of rudimentary endpoints (this is to be reevaluated and redesigned as data handling gets better)
    • The substeps taken involved getting familiar with the data and nested structure of the database table. According to Fabian, we aren’t particularly interested in the normalised version of the data
    • The real metrics are contained in the by_catergory and totals structs columns in the content gap metrics table. So the actual metrics in question are contained within these two columns. The proposed endpoint of form: /per-category/{project}/{content_gap}/{category}/{metric}/{start}/{end} captures a good first step in providing a query schema for retrieving data from the table over an api
  • Moving the data between Hive and Cassandra. This has been the more challenging aspect of this project. Wikimedia’s infrastructure is quite complex and it takes time to understand how data is generated and flows between the various databases and data infrastructure systems across all the projects. In this case, all data is initially written to Hive (during the collection and aggregation process presumably) after which the data is subsequently moved to cassandra by prewritten queries automated with airflow jobs (the dags are available in the repository and implemented in python).
    • I have written a test query to move the data. Which will be updated given the new knowledge I now have about the structure of the content_gap_metrics table on the hive database. The update to the query will involve unravelling the struct columns (by_category and totals) into individual columns in cassandra. This takes advantage of cassandra’s columnar structure and makes for more efficient queries. (This is yet to be tested. Dan will show me how when he is available)
  • Setting up a test environment on my local machine to try out the AQS endpoints. This step has also been very challenging. The AQS project has a lot of moving parts and some obscure implementation details (I’m sure they are clear to people with more experience with the codebase). This has made setting up the test environment pretty challenging but progress is being made. So far:
    • Data has been retrieved from the hive table and written to an sqlite db for local testing.
    • A config.yaml file has been created to set up the environment according to specifications in the projects README. TODO:
    • Get the config working for local contesting (using the sqlite database)
    • Test the API endpoints already designed

Checking in.

Update on knowledge gaps endpoints for AQS

  • Significant progress has been made on the knowledge gaps endpoints.
    • Created the endpoint handlers for both the by_category and totals metrics on the data
    • Successfully moved test data from knowledge-gaps DB to local machine for testing
    • Set up a local containerized Cassandra instance and loaded the data from the downloaded table into the Cassandra instance
    • Got AQS up and running locally and tested. It works. SUCCESS!!!

All these steps may seem short, but the process to get it all working was a lot longer than intended and harder than expected.
As I've previously mentioned, Wikimedia's infrastructure can get overwhelming. It is complex, so navigating it as a first timer is a daunting task. Always have help (I had...a lot)

Learnings:

  • AQS is like a carefully crafted battlefield, and typos are the landmines. Check your spelling and copy and paste as many times as necessary to avoid typo issues. The multiple layers involved can make it hard to debug typo caused bugs and waste your time.
  • Ensure that your endpoints are well defined in your YAML files. This includes the v1 (that RESTBase registers as your url query path) and sys (that hosts the handlers, ie, your actual backend)
  • Cassandra can be a pain, but make sure you use all the fields in your schema for partition and/or clustering keys (Note: This is only relevant when testing)
  • Not much to add for refinery and the airflow jobs (I haven't gotten to testing that yet)

TODOs:

  • Test my hql script to load the knowledge-gaps data to cassandra (for the refinery repo)
  • Write and test the airflow jobs

That's all for now.
Nicholas signing out

Thanks for that update, @nickifeajika, your Learnings section is great inspiration for how we should write friendlier code. The new AQS I think definitely moves away from those kinds of landmines.

Mick test...

It's been awhile, and a lot has happened since my last check-in.

To start with, changes to the Knowledge-gaps hive database, specifically the content_gap_metrics table, had to be made.

The reasoning behind the change was due to confusion about the totals struct in the table. It was duplicated across categories and it wasn't clear how to it was computed or if it was even necessary to return the totals data as a useful metric.

There were also data pipeline issues around airflow and ingestion scheduling.

So, what are the changes?

  • A redesign of the knowledge gaps endpoint on AQS. We were able to reduce it down to a single endpoint (previously, we had 2) and return the totals data using the same endpoint (this was possible due to changes made to the ingestion query)
  • Extensive finetuning of the ingestion query in the refinery repo. We eliminated the need for a separate file job to run. Basically, we get all the data we need while eliminating the need for an extra DAG in airflow.
  • Redesigns to the content_gap_metrics table curtsey of Fabian. These redesigns were needed to introduce clarity to the totals struct and its use.

Issues and delays?

  • The only (minor) issue was with the Airflow jobs. It's a bit difficult to understand, but the data engineering team has been hard at work refining the workflow for testing and deploying new DAGs and airflow jobs.
  • Any other delay was caused mostly due to the decision making done around the knowledge_gaps data and the endpoints that we would use to serve this data to our end-users. We could get away with this because v1 of the endpoints and the query were basically ready well before the project deadline. This afforded us enough time to rethink and refine the process as much as possible. No regrets, we made significant improvements to the final endpoint due to this and I am confident the folks who use it will appreciate the simplicity

Here again. It's the last time, I promise.

So, lots if things have happened since my last time here

To summarise:

  • More changes to the refinery query (Non to the endpoints. We're done refining that, no puns intended)
  • The issue with the airflow job was permissions. Apparently, there was a change in ownership of the cluster that we wanted to write to and that caused some delays. Infrastructure is a hard partner to please.
  • Some necessary schema specifications were mistakenly left out in the AQS repo. Big thanks to Joseph Allemandou for pointing it out to us.
  • My internship ends tomorrow 😁. So this is the last time I'll be writing about my development journey on this project here (see..., I told you it was the last time. I keep my promises)

So what does this means:

  • Infrastructure delays are unfortunately out of our hands. Given the complexity of Wikimedia's data infrastructure, a lot of consensus is needed before anything goes into production and the infrastructure hiccups only added to the time delta. This means that we didn't get to see the endpoints work in production, even though they've been ready for over a month. So we just we just kept on tweaking it. This tweaks almost cost us 😅, but we made everything better in the end.
  • The endpoints are good to go and I have it on good authority that we have also been given the green flag for deployment to production.

Hopefully, by this time next week, the endpoints will be in production and available to the public. Use responsibly 🙂

It's been a wild and amazing journey. Thank you for reading my ramblings on here and I hope you all make good use of the data that will soon be made available.

At this junction, I hang up my boots.

Nicholas, signing out...

fkaelin changed the task status from Open to In Progress.Jul 12 2023, 6:48 PM
fkaelin triaged this task as Medium priority.
fkaelin moved this task from Backlog to In Progress on the Research board.

As the completion of the project is planned outside the scope of the internship, I will close this as resolved.

See the follow-up sub-tasks in T331158