[SPIKE] Document decision to use a single table per base schema
Closed, ResolvedPublicSpike
Actions

Assigned To

Authored By

	phuedx
	Jun 10 2024, 1:35 PM

Description

Background

Data Products and Data Platform Engineering have arrived at an alternative to dynamically creating/destroying stream configs: We create a stream per Metrics Platform Base Schema and submit all Base-Schema-conforming events to those streams. This decision should be documented for posterity.

AC

The decision is documented on-wiki: https://wikitech.wikimedia.org/wiki/Metrics_Platform/Decision_Records/Single_Table_Per_Base_Schema
A timeline of tasks and discussions should be included in the document

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		Sfaci	T366949 MPIC: Add stream name to forms/database/api
Open		None	T366807 [EPIC] Update Metrics Platform Client Libraries to accept instrument name
Resolved	Spike	cjming	T367057 [SPIKE] Document decision to use a single table per base schema

Event Timeline

phuedx created this task.Jun 10 2024, 1:35 PM

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJun 10 2024, 1:35 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

phuedx updated the task description. (Show Details)Jun 10 2024, 1:37 PM

VirginiaPoundstone added a project: Data Products.Jun 10 2024, 6:20 PM

phuedx added a parent task: T366807: [EPIC] Update Metrics Platform Client Libraries to accept instrument name.Jun 21 2024, 2:59 PM

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 15); removed Data Products.Jun 21 2024, 4:03 PM

cjming mentioned this in T366949: MPIC: Add stream name to forms/database/api.Jun 21 2024, 11:48 PM

• WDoranWMF moved this task from Sprint Backlog to Wormhole To Sprint 16 on the Data Products (Data Products Sprint 15) board.Jul 3 2024, 4:51 PM

• WDoranWMF edited projects, added Data Products (Data Products Sprint 16); removed Data Products (Data Products Sprint 15).

VirginiaPoundstone triaged this task as High priority.Jul 12 2024, 4:16 PM

cjming claimed this task.Jul 23 2024, 4:12 AM

cjming moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 16) board.

See https://wikitech.wikimedia.org/wiki/Metrics_Platform/Decision_Records/Single_Table_Per_Base_Schema

cjming moved this task from In Process to Code Review / Tech Input on the Data Products (Data Products Sprint 16) board.Jul 24 2024, 5:09 AM

hi @VirginiaPoundstone @WDoranWMF @mpopov @lbowmaker @Ahoelzl

Does https://wikitech.wikimedia.org/wiki/Metrics_Platform/Decision_Records/Single_Table_Per_Base_Schema align with your understanding of where we're at? Are we all in consensus? If so, I'll update status from DRAFT to accepted.

cc @phuedx @Ottomata @gmodena -- Since I'm quoting your comments on the relevant tickets, presumably you all cosign. But please lmk if anything has changed in recent memory/discussions.

Hi, looks right to me!

The only comment I have is on

Maintain static event stream configuration and create a stream per Metrics Platform base schema

The sentiment is correct, but we should be clear that it is not a requirement or preference from Data Engineering side that you only have 1 stream per schema. You can have more than 1 stream per schema if it is useful. See https://phabricator.wikimedia.org/T361853#9825680 and following comments.

Thank you!

it is not a requirement or preference from Data Engineering side that you only have 1 stream per schema. You can have more than 1 stream per schema if it is useful.

Thanks for pointing that out! Duly noted - updated decision record accordingly.

cjming updated the task description. (Show Details)Jul 24 2024, 10:31 PM

VirginiaPoundstone edited projects, added Data Products (Data Products Sprint 17); removed Data Products (Data Products Sprint 16).Jul 29 2024, 7:16 PM

VirginiaPoundstone moved this task from Sprint Backlog to Code Review / Tech Input on the Data Products (Data Products Sprint 17) board.

VirginiaPoundstone mentioned this in T366802: Update Metrics Platform Base Schemas to include instrument name.Jul 29 2024, 7:42 PM

Please add the following to Negative Consequences:

Vast majority of interaction data would be going into one massive table which will create significant limitations for how much data we would be able to query with Presto – potentially only an hour at a time as opposed to multiple days or weeks or even months that is possible now with the smaller, per-instrument tables. Depending on how powerful our Presto cluster is, we may would likely have to switch to working with interaction data exclusively outside of Superset's SQL Lab since Presto and Spark SQL differ substantially and require a high degree to effort to translate queries between those two SQL dialects.
This will also negatively affect our ability to create Superset dashboards with Presto based on the un-aggregated interaction data, which has become a common practice among Product Analysts. We accept this consequence because the metrics we measure and make available in those dashboards and other reports should be pre-computed with data pipelines (that have access to more powerful and robust Spark SQL) rather than calculated on-the-fly with Presto. We can still use Presto but mainly for working with pre-computed measurements of interaction metrics rather than with raw interaction data.

And yes, I agree with the recommendation/decision.

In T367057#10025457, @mpopov wrote:

Please add the following to Negative Consequences:

Added! Thanks for the explicit call outs.

will create significant limitations for how much data we would be able to query with Presto

If we partition correctly, it shouldn't. T366627#9871868 (We need to verify this though.)

mpopov mentioned this in T366627: [MPIC] Analyse risk of potential performance issues with static approach to stream configuration.Jul 30 2024, 4:13 PM

@cjming @Ottomata: Another negative to document (and think about): event sanitization. We can configure sanitization/retention policies on a per-instrument basis since they are different streams/tables, but with the monostream/monotable we would lose that flexibility. Without changing how the current sanitization pipeline works, we would have a single entry in the allowlist for the monotable. We would have to reconsider how we evaluate risk when it comes to retaining sanitized data longer than 90 days.

Side note: I also think it's worth re-evaluating whether event sanitization system is a legacy artifact or if it has outlived its usefulness and can be decommissioned at some point, but that's outside the scope of this decision.

Oh yes! Very good point @mpopov that is true.

We hope to one day (after Refined event tables are on Iceberg) (and maybe after Datasets Config???) take a look at sanitization and retention and refactor it, possibly using in place updates and deletes via Iceberg. If/when we do that we should consider this use case.

Thanks @mpopov @Ottomata - added your comments re: sanitization to decision record.

Seems like now the negative consequences are growing. Presumably everyone is still on board.

nettrom_WMF mentioned this in T370907: Metrics Platform Integration: Agree on a stream name convention.Aug 2 2024, 8:15 PM

@cjming great decision record write up. It reflects my understanding of the problem space and motivates the decision, I support the decision and to move forward.

Regarding negative performance implications, there are several options we can follow up on: table partitioning, migration to Iceberg, table-split-on-write, dashboard metrics post computation, so I'm not concerned about that. Also, as it has been pointed out, the sanitization system could be revisited for refactoring.