Page MenuHomePhabricator

[SPIKE] Document decision to use a single table per base schema
Closed, ResolvedPublicSpike

Description

Background

Data Products and Data Platform Engineering have arrived at an alternative to dynamically creating/destroying stream configs: We create a stream per Metrics Platform Base Schema and submit all Base-Schema-conforming events to those streams. This decision should be documented for posterity.

AC

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Spike". · View Herald TranscriptJun 10 2024, 1:35 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

hi @VirginiaPoundstone @WDoranWMF @mpopov @lbowmaker @Ahoelzl

Does https://wikitech.wikimedia.org/wiki/Metrics_Platform/Decision_Records/Single_Table_Per_Base_Schema align with your understanding of where we're at? Are we all in consensus? If so, I'll update status from DRAFT to accepted.

cc @phuedx @Ottomata @gmodena -- Since I'm quoting your comments on the relevant tickets, presumably you all cosign. But please lmk if anything has changed in recent memory/discussions.

Hi, looks right to me!

The only comment I have is on

Maintain static event stream configuration and create a stream per Metrics Platform base schema

The sentiment is correct, but we should be clear that it is not a requirement or preference from Data Engineering side that you only have 1 stream per schema. You can have more than 1 stream per schema if it is useful. See https://phabricator.wikimedia.org/T361853#9825680 and following comments.

Thank you!

it is not a requirement or preference from Data Engineering side that you only have 1 stream per schema. You can have more than 1 stream per schema if it is useful.

Thanks for pointing that out! Duly noted - updated decision record accordingly.

Please add the following to Negative Consequences:

  • Vast majority of interaction data would be going into one massive table which will create significant limitations for how much data we would be able to query with Presto – potentially only an hour at a time as opposed to multiple days or weeks or even months that is possible now with the smaller, per-instrument tables. Depending on how powerful our Presto cluster is, we may would likely have to switch to working with interaction data exclusively outside of Superset's SQL Lab since Presto and Spark SQL differ substantially and require a high degree to effort to translate queries between those two SQL dialects.
  • This will also negatively affect our ability to create Superset dashboards with Presto based on the un-aggregated interaction data, which has become a common practice among Product Analysts. We accept this consequence because the metrics we measure and make available in those dashboards and other reports should be pre-computed with data pipelines (that have access to more powerful and robust Spark SQL) rather than calculated on-the-fly with Presto. We can still use Presto but mainly for working with pre-computed measurements of interaction metrics rather than with raw interaction data.

And yes, I agree with the recommendation/decision.

Please add the following to Negative Consequences:

Added! Thanks for the explicit call outs.

will create significant limitations for how much data we would be able to query with Presto

If we partition correctly, it shouldn't. T366627#9871868 (We need to verify this though.)

@cjming @Ottomata: Another negative to document (and think about): event sanitization. We can configure sanitization/retention policies on a per-instrument basis since they are different streams/tables, but with the monostream/monotable we would lose that flexibility. Without changing how the current sanitization pipeline works, we would have a single entry in the allowlist for the monotable. We would have to reconsider how we evaluate risk when it comes to retaining sanitized data longer than 90 days.

Side note: I also think it's worth re-evaluating whether event sanitization system is a legacy artifact or if it has outlived its usefulness and can be decommissioned at some point, but that's outside the scope of this decision.

Oh yes! Very good point @mpopov that is true.

We hope to one day (after Refined event tables are on Iceberg) (and maybe after Datasets Config???) take a look at sanitization and retention and refactor it, possibly using in place updates and deletes via Iceberg. If/when we do that we should consider this use case.

Thanks @mpopov @Ottomata - added your comments re: sanitization to decision record.

Seems like now the negative consequences are growing. Presumably everyone is still on board.

@cjming great decision record write up. It reflects my understanding of the problem space and motivates the decision, I support the decision and to move forward.

Regarding negative performance implications, there are several options we can follow up on: table partitioning, migration to Iceberg, table-split-on-write, dashboard metrics post computation, so I'm not concerned about that. Also, as it has been pointed out, the sanitization system could be revisited for refactoring.

Fantastic - I updated status of the decision record to ACCEPTED. If anyone disagrees, please lmk. Moving this to Sign Off.