Page MenuHomePhabricator

Scale up metricsinfra prometheus beyond one Prometheus instance (Thanos/Cortex/similar)
Closed, ResolvedPublic

Description

One prometheus instance is a SPOF, and might not even be able to monitor all the instances. This task (likely going to be split into subtasks) is to investigate and build out a HA/scaled-up configuration, possibly using some of those tools:

  • Thanos (used on wmf prod too)
  • Cortex (advertises "multi-tenancy")

Note that this task is purely about the ability to query data from multiple sources, data storage is still handled on individual Prometheus nodes. While Thanos Store looks promising, it requires an object store (such as swift) and is considered Future Work.

Event Timeline

taavi triaged this task as Medium priority.Jul 7 2021, 6:40 PM
taavi created this task.

Cortex's multi tenancy seems to require separate Prometheus instances per "tenant". With that in mind I'm leaning towards using Thanos for scaling metricsinfra up, mainly because it's already used an tested on production and metricsinfra could possibly re-use parts of its puppetisation.

Change 806551 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:wmcs::metricsinfra::prometheus: enable thanos sidecar

https://gerrit.wikimedia.org/r/806551

Change 806552 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:metricsinfra: add thanos query

https://gerrit.wikimedia.org/r/806552

Change 806553 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:metricsinfra::haproxy: add thanos routing

https://gerrit.wikimedia.org/r/806553

Change 806551 merged by David Caro:

[operations/puppet@production] P:wmcs::metricsinfra::prometheus: enable thanos sidecar

https://gerrit.wikimedia.org/r/806551

Change 806552 merged by David Caro:

[operations/puppet@production] P:metricsinfra: add thanos query

https://gerrit.wikimedia.org/r/806552

Change 806553 merged by David Caro:

[operations/puppet@production] P:metricsinfra::haproxy: add thanos routing

https://gerrit.wikimedia.org/r/806553

Change 850629 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P:metricsinfra: add thanos rule

https://gerrit.wikimedia.org/r/850629

Change 850629 merged by David Caro:

[operations/puppet@production] P:metricsinfra: add thanos rule

https://gerrit.wikimedia.org/r/850629