Audit/consider enabling CPU performance governor on DPE SRE-owned hosts
Open, MediumPublic
Actions

Assigned To

Authored By

	bking
	Apr 18 2024, 6:30 PM

Description

Per T336443 (and from reading T225713, T315398, T338944, and T328957) , we've noticed that enabling the CPU performance governor on our R450 WDQS hosts significantly improved triples ingestion speed (triples ingestion speed could be considered a proxy for single-threaded performance). On the negative side, enabling this feature is associated with increased power consumption.

Creating this ticket to audit the DPE SRE fleet for CPU performance governor settings and decide whether or not to enable them.

Details

Other Assignee: BTullis

Subject	Repo	Branch	Lines /-
Enable the performace CPU governor on Hadoop workers	operations/puppet	production	15 -4
datahub: leverage liveness and readiness probes for the gms and consumers	operations/deployment-charts	master	62 -11
stat (dse) hosts: enable CPU performance governor	operations/puppet	production	3 -0
elasticsearch: enable CPU performance governor	operations/puppet	production	4 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	bking	T362922 Audit/consider enabling CPU performance governor on DPE SRE-owned hosts
Open	None	T365814 Test whether or not CPU performance governor helps Elasticsearch performance
Resolved	BTullis	T365878 Test whether or not CPU performance governor helps Hadoop Performance
Resolved	bking	T372941 Review I/O setup on stat1008

Event Timeline

bking created this task.Apr 18 2024, 6:30 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 18 2024, 6:30 PM

Gehel triaged this task as Medium priority.Apr 23 2024, 12:19 PM

Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.

Gehel moved this task from Toil / Automation to 2024.04.15 - 2024.05.05 on the Data-Platform-SRE board.

Gehel edited projects, added Data-Platform-SRE (2024.04.15 - 2024.05.05); removed Data-Platform-SRE.

Gehel updated the task description. (Show Details)Apr 23 2024, 3:19 PM

bking edited projects, added Data-Platform-SRE; removed Data-Platform-SRE (2024.04.15 - 2024.05.05).Apr 30 2024, 3:46 PM

Gehel moved this task from Incoming to Toil / Automation on the Data-Platform-SRE board.May 3 2024, 3:48 PM

I just checked whether the dumps snapshot hosts have the performance governor enabled and it appears that they do:

btullis@cumin1002:~$ sudo cumin --no-progress A:snapshot 'cat /etc/default/cpufrequtils'
10 hosts will be targeted:
snapshot[1008-1017].eqiad.wmnet
OK to proceed on 10 hosts? Enter the number of affected hosts to confirm or "q" to quit: 10
===== NODE GROUP =====
(10) snapshot[1008-1017].eqiad.wmnet
----- OUTPUT of 'cat /etc/default/cpufrequtils' -----
GOVERNOR=performance
================
100.0% (10/10) success ratio (>= 100.0% threshold) for command: 'cat /etc/default/cpufrequtils'.
100.0% (10/10) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

This makes sense, since they are already attempting to max out each CPU, more or less. Had it not been applied already, I would have suggested that we enable it for these servers.

Also, all 8 of the worker nodes in the dse-k8s cluster are using it:

btullis@cumin1002:~$ sudo cumin --no-progress A:dse-k8s-worker 'cat /etc/default/cpufrequtils'
8 hosts will be targeted:
dse-k8s-worker[1001-1008].eqiad.wmnet
OK to proceed on 8 hosts? Enter the number of affected hosts to confirm or "q" to quit: 8
===== NODE GROUP =====
(8) dse-k8s-worker[1001-1008].eqiad.wmnet
----- OUTPUT of 'cat /etc/default/cpufrequtils' -----
GOVERNOR=performance
================
100.0% (8/8) success ratio (>= 100.0% threshold) for command: 'cat /etc/default/cpufrequtils'.
100.0% (8/8) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

...plus all of our Ceph servers.

btullis@cumin1002:~$ sudo cumin --no-progress A:cephosd 'cat /etc/default/cpufrequtils'
5 hosts will be targeted:
cephosd[1001-1005].eqiad.wmnet
OK to proceed on 5 hosts? Enter the number of affected hosts to confirm or "q" to quit: 5
===== NODE GROUP =====
(5) cephosd[1001-1005].eqiad.wmnet
----- OUTPUT of 'cat /etc/default/cpufrequtils' -----
GOVERNOR=performance
================
100.0% (5/5) success ratio (>= 100.0% threshold) for command: 'cat /etc/default/cpufrequtils'.
100.0% (5/5) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

So I guess the most important hosts to think about are:

Hadoop workers
Hadoop namenodes
Stats servers
Kafka-jumbo brokers
Presto workers
Druid workers (analytics and public clusters)
Elasticsearch cluster nodes
MariaDB
- analytics_meta (an-mariadb100[1-2])
- mediawiki private replicas (dbstore100[7-9])
- wikireplicas (clouddb10[13-21])
- wikireplica private analytics (clouddb1021 -> being refreshed by an-redacteddb1001)
PostgreSQL (an-db100[1-2])

I think that the Elasticsearch and Hadoop use cases are probably the most compelling. I'd be happy to roll out the governor settings to these hosts, but we should check with DC-Ops to see if they're OK with the increased power budget before doing so.

Change #1035534 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] elasticsearch: enable CPU performance governor

https://gerrit.wikimedia.org/r/1035534

gerritbot added a project: Patch-For-Review.May 23 2024, 5:56 PM

@wiki_willy: We should be aware of this as it will affect 50 servers in both codfw and eqiad, and we may see an increase in power usage across these fleets.

Change #1035534 abandoned by Bking:

[operations/puppet@production] elasticsearch: enable CPU performance governor

Reason:

needs a phased rollout

https://gerrit.wikimedia.org/r/1035534

@RobH @wiki_willy

After pondering this change a bit more, I've decided to abandon the above patch in favor of a more phased rollout. Crucially, we need to verify this will give us enough of a performance benefit to justify the extra power usage.

There are also two different classes of servers to consider:

sudo cumin A:elastic 'w'

elastic[2055-2109].codfw.wmnet,elastic[1053-1107].eqiad.wmnet

cumin A:hadoop-all 'w'

an-coord[1003-1004].eqiad.wmnet,an-master[1003-1004].eqiad.wmnet,an-worker[1078-1175].eqiad.wmnet,analytics[1070-1077].eqiad.wmnet @BTullis is a better contact for these hosts.

I'll get a ticket started to cover each class of servers and CC y'all on both. Sorry for the confusion!

Maintenance_bot removed a project: Patch-For-Review.May 24 2024, 1:31 PM

Thanks for the heads up @bking. I went ahead and checked Netbox, just to ensure all the servers were dispersed pretty evenly across the different racks...which they are (listed below is the rack and the quantity of servers in each rack). For reference, the bolded line items are the racks that are currently pulling a bit more on power. We could do a before and after snapshot using Grafana (https://grafana.wikimedia.org/d/f64mmDzMz/power-usage?orgId=1&from=now-30d&to=now), though I have feeling we should still be ok with the increased power.

@Papaul, @Jhancock.wm, @VRiley-WMF, and @Jclark-ctr for visibility

elastic[2055-2109]

A2 - 5
A4 - 3
A7 - 6
B2 - 8
B4 - 5
B7 - 2
C4 - 6
C7 - 5
D2 - 2
D4 - 5
D7 - 6

elastic[1053-1107]

A4 - 5
A7 - 4
B2 - 3
B4 - 4
B7 - 3
C4 - 2
C7 - 7
D2 - 2
D4 - 4
D6 - 1
D7 - 2
E1 - 3
E2 - 2
E3 - 2

an-coord[1003-1004]

E1 - 1
F1 - 1

an-master[1003-1004]

C2 - 1
D7 - 1

an-worker[1078-1175]

A2 - 6
A4 - 6
A7 - 7
B2 - 5
*B4 - 5
B7 - 5
C2 - 6
C4 - 7
C7 - 5
D2 - 4
D4 - 6
D7 - 6
E1 - 3
E2 - 1
E3 - 1
E5 - 3
E6 - 3
E7 - 3
F1 - 3
F2 - 1
F3 - 2
F5 - 3
F6 - 3
F7 - 3

analytics[1070-1077]

A5 - 2
B3 - 1
B7 - 1
C3 - 1
C7 - 1
D2 - 1
D7 - 1

Change #1063237 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] stat (dse) hosts: enable CPU performance governor

https://gerrit.wikimedia.org/r/1063237

gerritbot added a project: Patch-For-Review.Aug 16 2024, 6:31 PM

Change #1063237 merged by Bking:

[operations/puppet@production] stat (dse) hosts: enable CPU performance governor

https://gerrit.wikimedia.org/r/1063237

Maintenance_bot removed a project: Patch-For-Review.Aug 16 2024, 7:30 PM

As mentioned in T372416, we've had some issues with stat hosts becoming unresponsive due to high load. After a little testing on stat1010 (which caused load to drop precipitously) , I enabled the performance governor on all stat hosts. Based on the 5x drop in load and my previous experience with the governor in T336443 , I believe this should significantly increase performance.

bking claimed this task.Aug 16 2024, 8:42 PM

bking edited projects, added Data-Platform-SRE (2024.08.17 - 2024.09.06); removed Data-Platform-SRE.

bking moved this task from Backlog - project to In Progress on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.

bking mentioned this in T372416: Implement cgroups for users' JupyterHub environments in order to mitigate resource contention on the stat servers.Aug 20 2024, 8:03 PM

Gehel added a parent task: T373446: Improve developer experience on stat hosts (SRE-scoped).Aug 27 2024, 3:04 PM

bking removed a parent task: T373446: Improve developer experience on stat hosts (SRE-scoped).Aug 27 2024, 3:10 PM

bking closed subtask T372941: Review I/O setup on stat1008 as Resolved.Aug 27 2024, 9:55 PM

bking moved this task from In Progress to Backlog - operations on the Data-Platform-SRE (2024.08.17 - 2024.09.06) board.Sep 5 2024, 7:28 PM

Gehel edited projects, added Data-Platform-SRE (2024.09.06 - 2024.09.27); removed Data-Platform-SRE (2024.08.17 - 2024.09.06).Sep 6 2024, 9:49 AM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.09.06 - 2024.09.27) board.

Change #1072529 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Enable the performace CPU governor on Hadoop workers

https://gerrit.wikimedia.org/r/1072529

gerritbot added a project: Patch-For-Review.Sep 12 2024, 12:22 PM

In T362922#9831062, @wiki_willy wrote:

Thanks for the heads up @bking. I went ahead and checked Netbox, just to ensure all the servers were dispersed pretty evenly across the different racks...which they are (listed below is the rack and the quantity of servers in each rack). For reference, the bolded line items are the racks that are currently pulling a bit more on power. We could do a before and after snapshot using Grafana (https://grafana.wikimedia.org/d/f64mmDzMz/power-usage?orgId=1&from=now-30d&to=now), though I have feeling we should still be ok with the increased power.

Just checking, @wiki_willy are you still happy for me to enable the performance governor on the Hadoop workers (an-worker1*|analytics1*) ?

I have created a patch here: https://gerrit.wikimedia.org/r/c/operations/puppet/ /1072529

...but I wasnted to double-check with you before rolling it out. Do you need us to do a before/after snapshot of the power usage?

an-worker[1078-1175]

A2 - 6

A4 - 6

A7 - 7

B2 - 5

*B4 - 5

B7 - 5

C2 - 6

C4 - 7

C7 - 5

D2 - 4

D4 - 6

D7 - 6

E1 - 3

E2 - 1

E3 - 1

E5 - 3

E6 - 3

E7 - 3

F1 - 3

F2 - 1

F3 - 2

F5 - 3

F6 - 3

F7 - 3

analytics[1070-1077]

A5 - 2

B3 - 1

B7 - 1

C3 - 1

C7 - 1

D2 - 1

D7 - 1

@Jclark-ctr and @VRiley-WMF - can you confirm if we're ok with the Data Platform team increasing power on the hosts listed above? Thanks, Willy

BTullis mentioned this in T365878: Test whether or not CPU performance governor helps Hadoop Performance.Sep 13 2024, 11:48 AM

bking updated Other Assignee, added: BTullis.Sep 17 2024, 3:46 PM

@wiki_willy @BTullis so looking over racks the only one i am concerned about is b4-eqiad. We are getting alerts already for over power. We would need to move some out of that rack

In T362922#10169330, @Jclark-ctr wrote:

@wiki_willy @BTullis so looking over racks the only one i am concerned about is b4-eqiad. We are getting alerts already for over power. We would need to move some out of that rack

Thanks @Jclark-ctr - If it helps, we could decommisison an-presto1004 pretty much immediately, as this is due for a refresh in T374924: Bring an-presto10[16-20] into service to replace an-presto100[1-5] anyway.
Do you think that this would this stop the power alerts for that rack, or will we definitely need to do more?

The next on the list would probably be an-worker1085, which is also a little overdue for decom in T353784: Decommission an-worker10[78-95] & an-worker1116.
We'd rather keep this running for a little while longer, if possible, but it's not critical that we do so.

Gehel edited projects, added Data-Platform-SRE (2024.09.28 - 2024.10.18); removed Data-Platform-SRE (2024.09.06 - 2024.09.27).Sep 27 2024, 1:00 PM

Gehel moved this task from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.09.28 - 2024.10.18) board.

BTullis moved this task from Backlog - operations to Blocked/Waiting on the Data-Platform-SRE (2024.09.28 - 2024.10.18) board.Sep 27 2024, 2:50 PM

BTullis edited projects, added Data-Platform-SRE (2024.10.19 - 2024.11.08); removed Data-Platform-SRE (2024.09.28 - 2024.10.18).Oct 18 2024, 3:07 PM

BTullis moved this task from Backlog - project to Blocked/Waiting on the Data-Platform-SRE (2024.10.19 - 2024.11.08) board.

Gehel edited projects, added Data-Platform-SRE (2024.11.09 - 2024.11.29); removed Data-Platform-SRE (2024.10.19 - 2024.11.08).Fri, Nov 8, 9:47 AM

Gehel moved this task from Backlog - project to Blocked/Waiting on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.

@Jclark-ctr - I have re-worked the patch so that the five an-worker servers in eqiad b4 don't receive the updated CPU governor configuration, so there should be no change to the power consumption in this rack.
Are you happy for us to go ahead with this change to the other servers, some time next week?

Change #1090900 had a related patch set uploaded (by Brouberol; author: Brouberol):

[operations/deployment-charts@master] datahub: leverage liveness and readiness probes for the gms and consumers

https://gerrit.wikimedia.org/r/1090900

Change #1090900 merged by Brouberol:

[operations/deployment-charts@master] datahub: leverage liveness and readiness probes for the gms and consumers

https://gerrit.wikimedia.org/r/1090900

Change #1072529 merged by Btullis:

[operations/puppet@production] Enable the performace CPU governor on Hadoop workers

https://gerrit.wikimedia.org/r/1072529

Mentioned in SAL (#wikimedia-analytics) [2024-11-25T13:55:36Z] <btullis> enabled the performance CPU governor across the Hadoop cluster with https://gerrit.wikimedia.org/r/c/operations/puppet/ /1072529 for T362922

Maintenance_bot removed a project: Patch-For-Review.Mon, Nov 25, 2:31 PM

BTullis closed subtask T365878: Test whether or not CPU performance governor helps Hadoop Performance as Resolved.Mon, Nov 25, 2:55 PM

	F57277246: stat1010_load.png
	Aug 16 2024, 8:07 PM

Audit/consider enabling CPU performance governor on DPE SRE-owned hostsOpen, MediumPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Audit/consider enabling CPU performance governor on DPE SRE-owned hosts
Open, MediumPublic
Actions

Related Objects
Search...