Page MenuHomePhabricator

Audit/consider enabling CPU performance governor on DPE SRE-owned hosts
Open, MediumPublic

Description

Per T336443 (and from reading T225713, T315398, T338944, and T328957) , we've noticed that enabling the CPU performance governor on our R450 WDQS hosts significantly improved triples ingestion speed (triples ingestion speed could be considered a proxy for single-threaded performance). On the negative side, enabling this feature is associated with increased power consumption.

Creating this ticket to audit the DPE SRE fleet for CPU performance governor settings and decide whether or not to enable them.

Event Timeline

Gehel triaged this task as Medium priority.Apr 23 2024, 12:19 PM
Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.

I just checked whether the dumps snapshot hosts have the performance governor enabled and it appears that they do:

btullis@cumin1002:~$ sudo cumin --no-progress A:snapshot 'cat /etc/default/cpufrequtils'
10 hosts will be targeted:
snapshot[1008-1017].eqiad.wmnet
OK to proceed on 10 hosts? Enter the number of affected hosts to confirm or "q" to quit: 10
===== NODE GROUP =====
(10) snapshot[1008-1017].eqiad.wmnet
----- OUTPUT of 'cat /etc/default/cpufrequtils' -----
GOVERNOR=performance
================
100.0% (10/10) success ratio (>= 100.0% threshold) for command: 'cat /etc/default/cpufrequtils'.
100.0% (10/10) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

This makes sense, since they are already attempting to max out each CPU, more or less. Had it not been applied already, I would have suggested that we enable it for these servers.

Also, all 8 of the worker nodes in the dse-k8s cluster are using it:

btullis@cumin1002:~$ sudo cumin --no-progress A:dse-k8s-worker 'cat /etc/default/cpufrequtils'
8 hosts will be targeted:
dse-k8s-worker[1001-1008].eqiad.wmnet
OK to proceed on 8 hosts? Enter the number of affected hosts to confirm or "q" to quit: 8
===== NODE GROUP =====
(8) dse-k8s-worker[1001-1008].eqiad.wmnet
----- OUTPUT of 'cat /etc/default/cpufrequtils' -----
GOVERNOR=performance
================
100.0% (8/8) success ratio (>= 100.0% threshold) for command: 'cat /etc/default/cpufrequtils'.
100.0% (8/8) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

...plus all of our Ceph servers.

btullis@cumin1002:~$ sudo cumin --no-progress A:cephosd 'cat /etc/default/cpufrequtils'
5 hosts will be targeted:
cephosd[1001-1005].eqiad.wmnet
OK to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit: 5
===== NODE GROUP =====
(5) cephosd[1001-1005].eqiad.wmnet
----- OUTPUT of 'cat /etc/default/cpufrequtils' -----
GOVERNOR=performance
================
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'cat /etc/default/cpufrequtils'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

So I guess the most important hosts to think about are:

  • Hadoop workers
  • Hadoop namenodes
  • Stats servers
  • Kafka-jumbo brokers
  • Presto workers
  • Druid workers (analytics and public clusters)
  • Elasticsearch cluster nodes
  • MariaDB
    • analytics_meta (an-mariadb100[1-2])
    • mediawiki private replicas (dbstore100[7-9])
    • wikireplicas (clouddb10[13-21])
    • wikireplica private analytics (clouddb1021 -> being refreshed by an-redacteddb1001)
  • PostgreSQL (an-db100[1-2])

I think that the Elasticsearch and Hadoop use cases are probably the most compelling. I'd be happy to roll out the governor settings to these hosts, but we should check with DC-Ops to see if they're OK with the increased power budget before doing so.

Change #1035534 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elasticsearch: enable CPU performance governor

https://gerrit.wikimedia.org/r/1035534

@wiki_willy: We should be aware of this as it will affect 50 servers in both codfw and eqiad, and we may see an increase in power usage across these fleets.

Change #1035534 abandoned by Bking:

[operations/puppet@production] elasticsearch: enable CPU performance governor

Reason:

needs a phased rollout

https://gerrit.wikimedia.org/r/1035534

@RobH @wiki_willy

After pondering this change a bit more, I've decided to abandon the above patch in favor of a more phased rollout. Crucially, we need to verify this will give us enough of a performance benefit to justify the extra power usage.

There are also two different classes of servers to consider:

  • sudo cumin A:elastic 'w'

elastic[2055-2109].codfw.wmnet,elastic[1053-1107].eqiad.wmnet

  • cumin A:hadoop-all 'w'

an-coord[1003-1004].eqiad.wmnet,an-master[1003-1004].eqiad.wmnet,an-worker[1078-1175].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet @BTullis is a better contact for these hosts.

I'll get a ticket started to cover each class of servers and CC y'all on both. Sorry for the confusion!

Thanks for the heads up @bking. I went ahead and checked Netbox, just to ensure all the servers were dispersed pretty evenly across the different racks...which they are (listed below is the rack and the quantity of servers in each rack). For reference, the bolded line items are the racks that are currently pulling a bit more on power. We could do a before and after snapshot using Grafana (https://grafana.wikimedia.org/d/f64mmDzMz/power-usage?orgId=1&from=now-30d&to=now), though I have feeling we should still be ok with the increased power.

@Papaul, @Jhancock.wm, @VRiley-WMF, and @Jclark-ctr for visibility

elastic[2055-2109]

  • A2 - 5
  • A4 - 3
  • A7 - 6
  • B2 - 8
  • B4 - 5
  • B7 - 2
  • C4 - 6
  • C7 - 5
  • D2 - 2
  • D4 - 5
  • D7 - 6

elastic[1053-1107]

  • A4 - 5
  • A7 - 4
  • B2 - 3
  • B4 - 4
  • B7 - 3
  • C4 - 2
  • C7 - 7
  • D2 - 2
  • D4 - 4
  • D6 - 1
  • D7 - 2
  • E1 - 3
  • E2 - 2
  • E3 - 2

an-coord[1003-1004]

  • E1 - 1
  • F1 - 1

an-master[1003-1004]

  • C2 - 1
  • D7 - 1

an-worker[1078-1175]

  • A2 - 6
  • A4 - 6
  • A7 - 7
  • B2 - 5
  • *B4 - 5
  • B7 - 5
  • C2 - 6
  • C4 - 7
  • C7 - 5
  • D2 - 4
  • D4 - 6
  • D7 - 6
  • E1 - 3
  • E2 - 1
  • E3 - 1
  • E5 - 3
  • E6 - 3
  • E7 - 3
  • F1 - 3
  • F2 - 1
  • F3 - 2
  • F5 - 3
  • F6 - 3
  • F7 - 3

analytics[1070-1077]

  • A5 - 2
  • B3 - 1
  • B7 - 1
  • C3 - 1
  • C7 - 1
  • D2 - 1
  • D7 - 1

Change #1063237 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] stat (dse) hosts: enable CPU performance governor

https://gerrit.wikimedia.org/r/1063237

Change #1063237 merged by Bking:

[operations/puppet@production] stat (dse) hosts: enable CPU performance governor

https://gerrit.wikimedia.org/r/1063237

As mentioned in T372416, we've had some issues with stat hosts becoming unresponsive due to high load. After a little testing on stat1010 (which caused load to drop precipitously) , I enabled the performance governor on all stat hosts. Based on the 5x drop in load and my previous experience with the governor in T336443 , I believe this should significantly increase performance.

stat1010_load.png (801×1 px, 130 KB)

Change #1072529 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable the performace CPU governor on Hadoop workers

https://gerrit.wikimedia.org/r/1072529

Thanks for the heads up @bking. I went ahead and checked Netbox, just to ensure all the servers were dispersed pretty evenly across the different racks...which they are (listed below is the rack and the quantity of servers in each rack). For reference, the bolded line items are the racks that are currently pulling a bit more on power. We could do a before and after snapshot using Grafana (https://grafana.wikimedia.org/d/f64mmDzMz/power-usage?orgId=1&from=now-30d&to=now), though I have feeling we should still be ok with the increased power.

Just checking, @wiki_willy are you still happy for me to enable the performance governor on the Hadoop workers (an-worker1*|analytics1*) ?

I have created a patch here: https://gerrit.wikimedia.org/r/c/operations/puppet/ /1072529

...but I wasnted to double-check with you before rolling it out. Do you need us to do a before/after snapshot of the power usage?

an-worker[1078-1175]

  • A2 - 6
  • A4 - 6
  • A7 - 7
  • B2 - 5
  • *B4 - 5
  • B7 - 5
  • C2 - 6
  • C4 - 7
  • C7 - 5
  • D2 - 4
  • D4 - 6
  • D7 - 6
  • E1 - 3
  • E2 - 1
  • E3 - 1
  • E5 - 3
  • E6 - 3
  • E7 - 3
  • F1 - 3
  • F2 - 1
  • F3 - 2
  • F5 - 3
  • F6 - 3
  • F7 - 3

analytics[1070-1077]

  • A5 - 2
  • B3 - 1
  • B7 - 1
  • C3 - 1
  • C7 - 1
  • D2 - 1
  • D7 - 1

@Jclark-ctr and @VRiley-WMF - can you confirm if we're ok with the Data Platform team increasing power on the hosts listed above? Thanks, Willy

@wiki_willy @BTullis so looking over racks the only one i am concerned about is b4-eqiad. We are getting alerts already for over power. We would need to move some out of that rack

@wiki_willy @BTullis so looking over racks the only one i am concerned about is b4-eqiad. We are getting alerts already for over power. We would need to move some out of that rack

Thanks @Jclark-ctr - If it helps, we could decommisison an-presto1004 pretty much immediately, as this is due for a refresh in T374924: Bring an-presto10[16-20] into service to replace an-presto100[1-5] anyway.
Do you think that this would this stop the power alerts for that rack, or will we definitely need to do more?

The next on the list would probably be an-worker1085, which is also a little overdue for decom in T353784: Decommission an-worker10[78-95] & an-worker1116.
We'd rather keep this running for a little while longer, if possible, but it's not critical that we do so.

@Jclark-ctr - I have re-worked the patch so that the five an-worker servers in eqiad b4 don't receive the updated CPU governor configuration, so there should be no change to the power consumption in this rack.
Are you happy for us to go ahead with this change to the other servers, some time next week?

Change #1090900 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] datahub: leverage liveness and readiness probes for the gms and consumers

https://gerrit.wikimedia.org/r/1090900

Change #1090900 merged by Brouberol:

[operations/deployment-charts@master] datahub: leverage liveness and readiness probes for the gms and consumers

https://gerrit.wikimedia.org/r/1090900

Change #1072529 merged by Btullis:

[operations/puppet@production] Enable the performace CPU governor on Hadoop workers

https://gerrit.wikimedia.org/r/1072529

Mentioned in SAL (#wikimedia-analytics) [2024-11-25T13:55:36Z] <btullis> enabled the performance CPU governor across the Hadoop cluster with https://gerrit.wikimedia.org/r/c/operations/puppet/ /1072529 for T362922