Page MenuHomePhabricator

🔑Deploy Apache Superset for evaluating its capabilities
Closed, ResolvedPublic

Description

We want to evaluate whether Apache Superset can be of use for us when it comes to collecting platform statistics, e.g. T343719

To test this, deploy a Superset instance into our cluster, using the official Helm Chart: https://github.com/apache/superset/tree/master/helm/superset

The deployment should run against a dedicated MariaDB replica, so we'll need to configure the database chart accordingly as well.

Acceptance Criteria:

  • deployed instance of superset that is not accessible to the outside internet without authentification
  • configured to read from a new read only mariadb replica

Patches:

Details

Other Assignee
Tarrow

Event Timeline

After looking into this for longer than I expected I found, that while the official Helm Chart is straight forward to use, I ran into two problems where I don't have a good idea how to solve them yet:

Postgres Password needs to be available to init containers

The init containers run by superset on each release upgrade (i.e. anytime _anything_ changes) require the Postgres admin password. Usually, the password would be stored in a k8s secret and then passed to Helm on calling upgrade: https://docs.bitnami.com/general/how-to/troubleshoot-helm-chart-issues/#credential-errors-while-upgrading-chart-releases

This does not work when using helmfile as passing the --set parameter in helmDefaults.args will pass this to _all_ commands and some of them (e.g. helm list) do not support the flag, making the deployment fail.

It seems it's possible to put this in the values file itself, however this means we'd have the postgres admin password in the values file in plain text.

Right now, I am not sure what to do about this. Superset itself would also support using a SQLite database (which I think would be more than enough for us, it just stores user accounts and queries / dashboards), however this is not supported by the official Helm Chart.

MariaDB Helm Chart does not support multiple dedicated secondaries

As specced out in this ticket, we wanted to run queries from Superset against a dedicated secondary in order to not overload the production systems in case someone on the team unknowingly writes a heavy query. This, however, is not supported by the Bitnami Helm Chart we are using. The only thing we could do here is raise the number of replicas that are backing the secondary service, however these will still use the same service as ingress for any queries. It's also not possible to deploy another release off that chart that only deploys a secondary.

Things we might need to do here are:

  • Deploy a standalone replica using some TBD Helm Chart
  • Live with the fact that we put extra load on production
Fring changed the task status from Open to Stalled.Mar 21 2024, 8:48 AM

Stalling this as it needs further discussion / refinement

Fring removed Fring as the assignee of this task.Mar 21 2024, 8:48 AM
Fring moved this task from Doing to To do on the Wikibase Cloud (Kanban board Q1 2024) board.
Tarrow subscribed.
Anton.Kokh renamed this task from Deploy Apache Superset for evaluating its capabilities to 🔑Deploy Apache Superset for evaluating its capabilities.May 29 2024, 11:36 AM
Tarrow changed the task status from Stalled to Open.Jun 20 2024, 8:23 AM

I'm assuming some discussion happened about this while I was away but just to reiterate what I heard in the daily:

  • we will try to ship at least some trial version of this so that we can get a feel for the tool
  • we will use SQLite for the internal DB in this test case. To do this we will presumably either adjust the official superset helm chart or create our own from scratch
  • we will point superset at our existing SQL replica and presumably we will inform it's users to be extra careful not to accidentally overload this production replica