Create keyspace and table for Knowledge Gaps
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Milimetric
	Jun 26 2023, 10:07 PM

Description

Please and Thank you

We're in the process of deploying a new endpoint to AQS 1.0 (because 2.0 is not up and running yet). This is @nickifeajika's internship project. One of the pieces is a Cassandra table. After some trial, we believe this schema will appease AQS and the queries we need to run against it:

CREATE KEYSPACE "local_group_default_T_knowledge_gap_by_category";

CREATE TABLE "local_group_default_T_knowledge_gap_by_category".meta (
    key text,
    tid timeuuid,
    "_del" timeuuid,
    value text,
    PRIMARY KEY (key, tid)
) WITH CLUSTERING ORDER BY (tid ASC);

CREATE TABLE "local_group_default_T_knowledge_gap_by_category".data (
    "_domain" text,
    project text,
    category text,
    content_gap text,
    dt text,
    "_tid" timeuuid,
    metric text,
    "_del" timeuuid,
    value double,
    PRIMARY KEY (("_domain", project, category, content_gap), dt, "_tid", metric)
) WITH CLUSTERING ORDER BY (dt ASC, "_tid" DESC, metric ASC);

Our only query against this right now would basically be a select filtering on exact values of _domain, project, category, content_gap, and a range of dt.

Acceptance criteria:

Schema created on production cluster
Capacity projections documented

Related Objects
Search...

Status	Assigned	Task
Open	Miriam	T331154 The Knowledge Gap Index
Resolved	fkaelin	T331158 Knowledge Gaps Datasets and APIs
Resolved	• nickifeajika	T331159 A public API for Knowledge Gaps Metrics (Internship)
Resolved	Eevans	T340494 Create keyspace and table for Knowledge Gaps

Event Timeline

Milimetric created this task.Jun 26 2023, 10:07 PM

BTullis subscribed.Jun 26 2023, 10:12 PM

Eevans claimed this task.Jun 26 2023, 11:53 PM

Eevans triaged this task as Medium priority.

Ok, this has been created using:

CREATE KEYSPACE "local_group_default_T_knowledge_gap_by_category" WITH replication = {'class': 'NetworkTopologyStrategy', 'codfw': '3', 'eqiad': '3'};

CREATE TABLE "local_group_default_T_knowledge_gap_by_category".meta (
    key text,
    tid timeuuid,
    "_del" timeuuid,
    value text,
    PRIMARY KEY (key, tid)
) WITH CLUSTERING ORDER BY (tid ASC);

CREATE TABLE "local_group_default_T_knowledge_gap_by_category".data (
    "_domain" text,
    project text,
    category text,
    content_gap text,
    dt text,
    "_tid" timeuuid,
    metric text,
    "_del" timeuuid,
    value double,
    PRIMARY KEY (("_domain", project, category, content_gap), dt, "_tid", metric)
) WITH CLUSTERING ORDER BY (dt ASC, "_tid" DESC, metric ASC);

Eevans edited projects, added Cassandra; removed Data-Persistence.Jun 27 2023, 12:00 AM

Eevans updated the task description. (Show Details)

KHernandez-WMF moved this task from Backlog to Staged on the Research board.Jun 27 2023, 5:24 PM

KHernandez-WMF moved this task from Staged to Backlog on the Research board.

leila added a parent task: T331159: A public API for Knowledge Gaps Metrics (Internship).Jun 27 2023, 6:04 PM

leila removed a project: Research.

JArguello-WMF moved this task from Incoming (new tickets) to Radar (External Teams) on the Data-Engineering board.Jun 29 2023, 11:52 PM

Can this task be closed as done?

In T340494#9010004, @fkaelin wrote:

Can this task be closed as done?

Ideally, part of provisioning a new dataset would be to work out capacity planning. We don't have much of a process for that right now, but I'd like capture something before we close this.

This is a fairly small dataset - as of June 2023 about 12mb added per month, with about 2GB of data in total so far (as parquet files on hdfs). We do plan on adding additional knowledge gaps related metrics in the coming year, which will ~linearly add to the storage requirement. e.g. going from 4 to 8 content gaps requires ~double the storage space.

At what scale of dataset size should we do proper capacity planning?

Sfaci mentioned this in T344852: Review AQS 1.0 - Knowledge Gaps for porting to AQS 2.0.Aug 24 2023, 10:11 AM

fkaelin mentioned this in T345446: AQS content gap metrics ingestion job.Sep 1 2023, 4:12 PM

Following up on this, are there open questions/tasks regarding the creation/support of this dataset?

fkaelin mentioned this in T345441: Decide on data required for launch.Sep 12 2023, 7:03 PM

• lbowmaker moved this task from Radar (External Teams) to Icebox (not considered in current quarter) on the Data-Engineering board.Nov 10 2023, 1:23 PM

Eevans closed this task as Resolved.Jun 14 2024, 8:12 PM

Thanks. Is this now using AQS 2? It has been a moment, can you point to a current/good example job that writes to a AQS cassandra dataset from airflow?

Hi @lbowmaker @Eevans we see that this task has been resolved. Could you clarify what are the next steps? Has this task been resolved and this now on AQS 2 or has the task been declined? Thank you so much!

Including @VirginiaPoundstone for AQS 2 questions.

Create keyspace and table for Knowledge GapsClosed, ResolvedPublicActions

Description

Acceptance criteria:

Related ObjectsSearch...

Event Timeline

Create keyspace and table for Knowledge Gaps
Closed, ResolvedPublic
Actions

Related Objects
Search...