new external storage cluster(s)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Springle
	Jul 15 2015, 12:19 AM

Description

Our current external storage clusters just reached 90% disk usage and icinga sent warnings. We still have some months of time to spare; the current clusters has been filling up for at least 2.5 years.

We need to:

Review the ES server spec, mostly only to maximize disk space.
Order and provision 6 to 8 nodes for EQIAD (and same for CODFW; so 12 or 16 all up)

MediaWiki writes to multiple active ES clusters in order to avoid SPOF. A cluster is a minimum of 3 nodes (1 master, 2 slaves) but ideally for capacity 4 nodes (1 master, 3 slaves).

Details

Subject	Repo	Branch	Lines /-
Depool es1003, es1004, es1007 and es1010 for decommision	operations/mediawiki-config	master	4 -4
Depool es1001 for decommision; increase weight of es1015 and es1019	operations/mediawiki-config	master	3 -3
Switchover of es2 master from es1006 to es1011	operations/mediawiki-config	master	3 -3
Depool es1002 in order to clone it to new server es1016	operations/mediawiki-config	master	1 -1
Repool es1010, pool es1017 for the first time	operations/mediawiki-config	master	2 -1
Depool es1010 to clone it to es1017	operations/mediawiki-config	master	1 -1
Repool es1007, pool es1013 for the first time	operations/mediawiki-config	master	2 -1
Depool es1007 for maintenance	operations/mediawiki-config	master	1 -1
Depool es1001 for cloning, increase es1011 weight, pool es1014	operations/mediawiki-config	master	7 -6
Pool es1011, depool es1008 as storage nodes	operations/mediawiki-config	master	3 -2
Reorganization of new External Storage nodes	operations/puppet	production	12 -4
Adding new External Storage nodes as MariaDB::core	operations/puppet	production	25 -7

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Stalled		None	T106386 Compress data at external storage
		Resolved		• jcrespo	T105843 new external storage cluster(s)

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• jcrespo moved this task from Triage to Backlog on the DBA board.Jul 20 2015, 3:39 PM

In T105843#1465266, @jcrespo wrote:

First thing we need to determine is the future of this service: should new clusters or shards be added from the application point of view or should we be 100% transparent to the application and just add extra capacity to existing servers (both on disk and memory).

@Springle pointed that the innoDB buffer pool efficiency is far from perfect, and that this will need more than extra storage. This is a good point to gather a bit of feedback from the application side of things about the prospect of this service.

If just adding space isn't as efficient for InnoDB and we would do better with new servers then let's do that. As long as existing data is copied over to a new cluster I don't see why we couldn't. It would require a bit of read-only time to the affected wikis (can be kept minimal), but otherwise I don't see any real blockers here from the application POV...just some config changes.

I'm less sure on the "can we do better in MW" with regards to better compression of blobs and so forth. That's more a question for @tstarling or @aaron...

It would require a bit of read-only time

Do not worry about the operational part, thanks to replication impact would be minimal, if any. :-)

predicted_space_usage.png (387×747 px, 25 KB)

Please note that hardware purchase, installing and data migration takes months.

matthiasmullie added a subtask: T106363: Migrate Flow content to new separate logical External Store in production.Jul 21 2015, 3:32 AM

Krenair subscribed.Jul 21 2015, 2:20 PM

• jcrespo mentioned this in T106386: Compress data at external storage.Jul 21 2015, 4:25 PM

• jcrespo added a parent task: T106386: Compress data at external storage.

Removing "blocked by" as this will be done first.

From the high level point of view, there are 2 main options here:

Keeping the old servers, which have no warranty and right now no replacements (see for example T103843). Buy only a batch of larger disks (they probably cannot be just added to the existing ones)
Decommission the old servers in line with the renewal policy and buy more powerful servers that would allow a) consolidation, doing more with less servers, and faster b) less human resources needed due to newer parts c) for some amount of years, warranty-covered replacements.

One important constraint is that the service should be up and running by November (see previous graphic).

• jcrespo added a project: hardware-requests.Jul 21 2015, 5:02 PM

We could buy new servers, immediately configure MW to write to the new cluster, then recompress the old cluster, and decommission it when recompression is done (say 3-6 months).

We're currently using about 2.7 GB per day, for each of the two ES clusters (by linear regression on yearly ganglia part_max_used). At that rate, we would need about 4900 GB for the next 5 years, plus about 600GB for recompressed old data, so say 5500 GB. That should give plenty of headroom if we plan for another recompression/decommission cycle after 3 years.

If there is no recompression, and we just migrate the ~2700 GB of old data, the projected disk usage after 5 years will be more like 8100 GB per cluster.

It probably makes sense to have a dedicated cold storage cluster, instead of putting the newly recompressed data in the active cluster, since the hardware parameters are a bit different. I see that es1001-1004 is currently used for cold storage. It has a read load in the vicinity of 1MB/s per server, 7.8TB of disk usage, 1.3TB free, so it is a bit overpowered with 12 spindles. Presumably a lot of that 7.8TB has not been recompressed, especially cluster22 and cluster23.

So I am thinking that we want 3 new clusters per datacenter: 2 active and 1 cold. We switch MW to write to the 2 new clusters, then recompress as much as possible into the cold cluster, and also migrate any legacy data to it such as cluster1/cluster2. Then decommission es1001-es1010.

PleaseStand subscribed.Jul 22 2015, 10:32 AM

• JohnLewis subscribed.Jul 22 2015, 11:56 AM

• Elitre subscribed.Jul 22 2015, 2:03 PM

So, summarizing @RobH @Springle:

3 nodes per cluster, (ideally 4) x 3 clusters x 2 datacenters = 18 - 24 nodes. 2 out of the 3 clusters per datacenter are needed with less than 3 months hard deadline.
Hardware RAID
12 TB of hardware HD storage (6 on RAID 10) for 3 year provision, 16 TB for a 5 year provision (most important)
96 or 128 GB memory
IO bound, if we have to invest on memory vs. cpu vs, disk, invest on disks, but we do not need SSDs (thoughput over latency)
Do not invest on many CPUs (mysql does not paralelize over 32 cores)
Soft requirement: try to homogenize hardware for future mysql production purchases (for example, T106847)
Warranty, specially for disk replacement!

• jcrespo moved this task from Backlog to Blocked external/Not db team on the DBA board.Jul 24 2015, 5:35 PM

this quote is being tracked on https://rt.wikimedia.org/Ticket/Display.html?id=9507

1 to the provisioning.

Also 1 to Tim's plan, once we get the hardware.

This has two vendors (dell/hp) being tracked for quoting:

hp: https://rt.wikimedia.org/Ticket/Display.html?id=9507

dell: https://rt.wikimedia.org/Ticket/Display.html?id=9524

Some nodes, like es1009 are down to 6% available disk space.

Hardware has already been ordered for eqiad.

es1005 RAID degraded. We will not replace the failed disk as full server replacements are on its way in

                Device Present
                ================
Virtual Drives    : 1 
  Degraded        : 1 
  Offline         : 0 
Physical Devices  : 14 
  Disks           : 12 
  Critical Disks  : 4 
  Failed Disks    : 1

Enclosure Device ID: 32
Slot Number: 4
Drive's position: DiskGroup: 0, Span: 2, Arm: 0
Enclosure position: N/A
Device Id: 4
WWN: 5000C50054E8DE5C
Sequence Number: 3
Media Error Count: 3075
Other Error Count: 15
Predictive Failure Count: 71
Last Predictive Failure Event Seq Number: 13840
PD Type: SAS

Raw Size: 558.911 GB [0x45dd2fb0 Sectors]
Non Coerced Size: 558.411 GB [0x45cd2fb0 Sectors]
Coerced Size: 558.375 GB [0x45cc0000 Sectors]
Sector Size:  0
Firmware state: Failed
Device Firmware Level: ES65
Shield Counter: 0
Successful diagnostics completion on :  N/A
SAS Address(0): 0x5000c50054e8de5d
SAS Address(1): 0x0
Connected Port Number: 0(path0) 
Inquiry Data: SEAGATE ST3600057SS     ES656SL4G6AK            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None 
Device Speed: 6.0Gb/s 
Link Speed: 6.0Gb/s 
Media Type: Hard Disk Device
Drive Temperature :53C (127.40 F)
PI Eligibility:  No 
Drive is formatted for PI information:  No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s 
Port-1 :
Port status: Active
Port's Linkspeed: Unknown 
Drive has flagged a S.M.A.R.T alert : Yes

RobH moved this task from Backlog to Allocation/Ordering/Implementation on the hardware-requests board.Aug 14 2015, 6:36 PM

es1009 and es1006, both read/write masters are down to 5%-175G. Any news about the order?

• jcrespo mentioned this in T110008: es1005 and es1006 have degraded RAIDs (failed disks each).Aug 24 2015, 7:03 AM

It shows delivered on 2015-08-20. https://rt.wikimedia.org/Ticket/Display.html?id=9524 has not been updated, even though tracking shows delivered.

I'm going to assign this task to Chris, as the RT is also assigned to him.

I think everything is here but the disks?

It would be nice to have 2 units ASAP mounted alongside es1006 and es1009 (that will substitute), prepared to clone the existing data. I've asked for comment on the data migration here: T106386#1571242

Change 234035 had a related patch set uploaded (by Jcrespo):
Adding new External Storage nodes as MariaDB::core

https://gerrit.wikimedia.org/r/234035

gerritbot added a project: Patch-For-Review.Aug 26 2015, 6:40 PM

Change 234035 merged by Jcrespo:
Adding new External Storage nodes as MariaDB::core

https://gerrit.wikimedia.org/r/234035

• jcrespo mentioned this in rOPUPaa6f3108e1fb: Adding new External Storage nodes as MariaDB::core.Aug 26 2015, 7:08 PM

Arrangement:

ES1
===
es1012 [A2]
es1018 [D1]
es1016 [C2]

ES2
===
es1011 [A2] MASTER
es1013 [B1]
es1015 [C2]

ES3
===
es1014 [B1] MASTER
es1017 [D1]
es1019 [D3]

Change 234225 had a related patch set uploaded (by Jcrespo):
Reorganization of new External Storage nodes

https://gerrit.wikimedia.org/r/234225

Change 234225 merged by Jcrespo:
Reorganization of new External Storage nodes

https://gerrit.wikimedia.org/r/234225

• jcrespo mentioned this in rOPUPed13cb9260a6: Reorganization of new External Storage nodes.Aug 27 2015, 8:56 AM

Change 234244 had a related patch set uploaded (by Jcrespo):
Pool es1011, depool es1008 as storage nodes

https://gerrit.wikimedia.org/r/234244

Change 234244 merged by Jcrespo:
Pool es1011, depool es1008 as storage nodes

https://gerrit.wikimedia.org/r/234244

• jcrespo mentioned this in rOMWC606473b36ad2: Pool es1011, depool es1008 as storage nodes.Aug 27 2015, 11:08 AM

Change 234479 had a related patch set uploaded (by Jcrespo):
Depool es1001 for cloning, increase es1011 weight, pool es1014

https://gerrit.wikimedia.org/r/234479

Change 234479 merged by Jcrespo:
Depool es1001 for cloning, increase es1011 weight, pool es1014

https://gerrit.wikimedia.org/r/234479

• jcrespo mentioned this in rOMWC0a3f891ccd7a: Depool es1001 for cloning, increase es1011 weight, pool es1014.Aug 28 2015, 8:20 AM

The es1 servers may be some of our oldest servers for critical production usage (16GB of memory!). They still have 1.8TB drives, but they could be as old as 4 years, according to some (imported) phabricator tickets.

Change 234965 had a related patch set uploaded (by Jcrespo):
Depool es1007 for maintenance

https://gerrit.wikimedia.org/r/234965

Change 234965 merged by Jcrespo:
Depool es1007 for maintenance

https://gerrit.wikimedia.org/r/234965

• jcrespo mentioned this in rOMWC609a57781f4c: Depool es1007 for maintenance.Aug 31 2015, 9:26 AM

Change 235192 had a related patch set uploaded (by Jcrespo):
Repool es1007, pool es1013 for the first time

https://gerrit.wikimedia.org/r/235192

Change 235192 merged by Jcrespo:
Repool es1007, pool es1013 for the first time

https://gerrit.wikimedia.org/r/235192

• jcrespo mentioned this in rOMWC427d8e192889: Repool es1007, pool es1013 for the first time.Sep 1 2015, 7:10 AM

Change 235193 had a related patch set uploaded (by Jcrespo):
Depool es1010 to clone it to es1017

https://gerrit.wikimedia.org/r/235193

Change 235193 merged by Jcrespo:
Depool es1010 to clone it to es1017

https://gerrit.wikimedia.org/r/235193

• jcrespo mentioned this in rOMWCb4cd97787e7f: Depool es1010 to clone it to es1017.Sep 1 2015, 7:22 AM

Assigning this to @jcrespo

Thanks again, I've already seen the entries on racktables! Was waiting for that to fully own it.

Change 235276 had a related patch set uploaded (by Jcrespo):
Repool es1010, pool es1017 for the first time

https://gerrit.wikimedia.org/r/235276

Change 235276 merged by Jcrespo:
Repool es1010, pool es1017 for the first time

https://gerrit.wikimedia.org/r/235276

• jcrespo mentioned this in rOMWC8bf52c6206dc: Repool es1010, pool es1017 for the first time.Sep 2 2015, 9:01 AM

Change 235423 had a related patch set uploaded (by Jcrespo):
Depool es1002 in order to clone it to new server es1016

https://gerrit.wikimedia.org/r/235423

Change 235423 merged by Jcrespo:
Depool es1002 in order to clone it to new server es1016

https://gerrit.wikimedia.org/r/235423

• jcrespo mentioned this in rOMWCf7e061fcf087: Depool es1002 in order to clone it to new server es1016.Sep 2 2015, 9:39 AM

• jcrespo mentioned this in rOMWC079a4edf39b7: Increasing es2 and es3 servers to normal loads.Sep 3 2015, 10:31 AM

Change 235735 had a related patch set uploaded (by Jcrespo):
Switchover of es2 master from es1006 to es1011

https://gerrit.wikimedia.org/r/235735

Change 235735 merged by Jcrespo:
Switchover of es2 master from es1006 to es1011

https://gerrit.wikimedia.org/r/235735

• jcrespo mentioned this in rOMWC2db11b9c7a68: Switchover of es2 master from es1006 to es1011.Sep 3 2015, 3:11 PM

• jcrespo mentioned this in rOMWC15ef0bb3ecad: Depool es1006 (old master).Sep 3 2015, 3:39 PM

• jcrespo mentioned this in rOMWC5902192f3a80: Switchover es3 master from es1009 to es1014.Sep 3 2015, 4:22 PM

• jcrespo mentioned this in rOMWC477190fbdbc5: Repool es1002, es1016; Depool es1004.Sep 4 2015, 3:41 PM

• jcrespo mentioned this in rOMWC9663101f6175: Repool es1004 after maintenance; pool es1018 for the first time.Sep 7 2015, 3:04 PM

After several clones and failovers, the new nodes are working as masters, alongside the old nodes. I will now slowly start the slowly depooling of the old nodes so that they can be decommissioned soon.

• jcrespo mentioned this in rOMWC5e78046dd134: Pool es1015 and es1019 for the first time.Sep 8 2015, 2:40 PM

Change 237414 had a related patch set uploaded (by Jcrespo):
Depool es1001 for decommision; increase weight of es1015 and es1019

https://gerrit.wikimedia.org/r/237414

Change 237414 merged by Jcrespo:
Depool es1001 for decommision; increase weight of es1015 and es1019

https://gerrit.wikimedia.org/r/237414

• jcrespo mentioned this in rOMWCd7eeef51febc: Depool es1001 for decommision; increase weight of es1015 and es1019.Sep 10 2015, 7:41 PM

• jcrespo removed a project: Patch-For-Review.Sep 11 2015, 9:42 PM

• jcrespo mentioned this in rOMWC9695ae0b4dfd: Depool es1002, es1005, es1008 for decommission.Sep 14 2015, 10:35 AM

Change 238494 had a related patch set uploaded (by Jcrespo):
Depool es1003, es1004, es1007 and es1010 for decommision

https://gerrit.wikimedia.org/r/238494

Change 238494 merged by Jcrespo:
Depool es1003, es1004, es1007 and es1010 for decommision

https://gerrit.wikimedia.org/r/238494

• jcrespo mentioned this in rOMWCb62b36ae1b83: Depool es1003, es1004, es1007 and es1010 for decommision.Sep 15 2015, 4:54 PM

This is almost completed: we just need to wait for the old server to finish processing dump queries, stop mysqls to confirm we can stop them and clean up the configuration file.

• jcrespo removed a project: Patch-For-Review.Sep 15 2015, 5:08 PM