Page MenuHomePhabricator

Create keyspace and table for Knowledge Gaps
Closed, ResolvedPublic

Description

Please and Thank you

We're in the process of deploying a new endpoint to AQS 1.0 (because 2.0 is not up and running yet). This is @nickifeajika's internship project. One of the pieces is a Cassandra table. After some trial, we believe this schema will appease AQS and the queries we need to run against it:

CREATE KEYSPACE "local_group_default_T_knowledge_gap_by_category";

CREATE TABLE "local_group_default_T_knowledge_gap_by_category".meta (
    key text,
    tid timeuuid,
    "_del" timeuuid,
    value text,
    PRIMARY KEY (key, tid)
) WITH CLUSTERING ORDER BY (tid ASC);

CREATE TABLE "local_group_default_T_knowledge_gap_by_category".data (
    "_domain" text,
    project text,
    category text,
    content_gap text,
    dt text,
    "_tid" timeuuid,
    metric text,
    "_del" timeuuid,
    value double,
    PRIMARY KEY (("_domain", project, category, content_gap), dt, "_tid", metric)
) WITH CLUSTERING ORDER BY (dt ASC, "_tid" DESC, metric ASC);

Our only query against this right now would basically be a select filtering on exact values of _domain, project, category, content_gap, and a range of dt.


Acceptance criteria:
  • Schema created on production cluster
  • Capacity projections documented

Event Timeline

Eevans triaged this task as Medium priority.

Ok, this has been created using:

CREATE KEYSPACE "local_group_default_T_knowledge_gap_by_category" WITH replication = {'class': 'NetworkTopologyStrategy', 'codfw': '3', 'eqiad': '3'};

CREATE TABLE "local_group_default_T_knowledge_gap_by_category".meta (
    key text,
    tid timeuuid,
    "_del" timeuuid,
    value text,
    PRIMARY KEY (key, tid)
) WITH CLUSTERING ORDER BY (tid ASC);

CREATE TABLE "local_group_default_T_knowledge_gap_by_category".data (
    "_domain" text,
    project text,
    category text,
    content_gap text,
    dt text,
    "_tid" timeuuid,
    metric text,
    "_del" timeuuid,
    value double,
    PRIMARY KEY (("_domain", project, category, content_gap), dt, "_tid", metric)
) WITH CLUSTERING ORDER BY (dt ASC, "_tid" DESC, metric ASC);
Eevans updated the task description. (Show Details)
KHernandez-WMF moved this task from Staged to Backlog on the Research board.

Can this task be closed as done?

Ideally, part of provisioning a new dataset would be to work out capacity planning. We don't have much of a process for that right now, but I'd like capture something before we close this.

This is a fairly small dataset - as of June 2023 about 12mb added per month, with about 2GB of data in total so far (as parquet files on hdfs). We do plan on adding additional knowledge gaps related metrics in the coming year, which will ~linearly add to the storage requirement. e.g. going from 4 to 8 content gaps requires ~double the storage space.

At what scale of dataset size should we do proper capacity planning?

Following up on this, are there open questions/tasks regarding the creation/support of this dataset?

Thanks. Is this now using AQS 2? It has been a moment, can you point to a current/good example job that writes to a AQS cassandra dataset from airflow?

Hi @lbowmaker @Eevans we see that this task has been resolved. Could you clarify what are the next steps? Has this task been resolved and this now on AQS 2 or has the task been declined? Thank you so much!