Page MenuHomePhabricator

wiki_willy
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Apr 16 2019, 9:00 PM (292 w, 6 d)
Availability
Available
LDAP User
Wpao
MediaWiki User
WPao (WMF) [ Global Accounts ]

Recent Activity

Wed, Nov 13

wiki_willy added a comment to T375842: decommission mw[1349-1413].

Ah that makes sense, thanks for the info. We'll go ahead and move the server, after the Phabricator task is created. FWIW, all servers being ordered this fiscal year and moving forward will have 10g cards...and the refresh/upgrade to 10g switches in eqiad for rows C and D is supposed to happen probably later in Q4.

The new server is already in service. The main reason brought this up is the process we had to go through to get a 10G card in wikikube-ctrl1001 cause we need the extra bandwidth. I think that to do so, we 'll need to chose a server in a rack that has free 10G ports and re-cable. I 'll file a separate task

Wed, Nov 13, 9:42 PM · decommission-hardware

Tue, Nov 12

wiki_willy added a comment to T375842: decommission mw[1349-1413].

Hi @akosiaris - thanks for confirming. I think we already ordered the replacement host though via T368933. You're welcome to continue using wikikube-ctrl1001 for a longer period of time though, and dedicate the new server for something else in the meantime if you want?

Tue, Nov 12, 9:34 PM · decommission-hardware

Wed, Nov 6

wiki_willy updated subscribers of T371984: Q1:rack/setup/install backup2012.

Hi @Jhancock.wm and @Papaul - just a heads up, it looks like the test controller kit arrived yesterday:

Wed, Nov 6, 7:25 PM · SRE, Data-Persistence, Data-Persistence-Backup, ops-codfw, DC-Ops
wiki_willy updated subscribers of T371416: Q1:rack/setup/install backup1012.

Just a heads up @Jclark-ctr & @VRiley-WMF - the test controller kit should've arrived yesterday:

Wed, Nov 6, 7:23 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Mon, Nov 4

wiki_willy renamed T378828: Q2:eqiad:(6) Ceph cluster expansion - custom config 10g from Q2:eqiad:(12) Ceph cluster expansion - custom config 10g to Q2:eqiad:(6) Ceph cluster expansion - custom config 10g.
Mon, Nov 4, 8:23 PM · DC-Ops

Thu, Oct 31

wiki_willy added a comment to T378584: Evaluate hw-raid controllers for Supermicro's Config J.

Met with the Supermicro team today, who believes the RAID kit should be approved either today or tomorrow, and shipped out after that. For reference, here are some details they sent us below:

Thu, Oct 31, 6:32 PM · SRE-swift-storage, Infrastructure-Foundations, Data-Persistence, DC-Ops

Wed, Oct 30

wiki_willy added a comment to T378584: Evaluate hw-raid controllers for Supermicro's Config J.

Meeting set with Supermicro team on October 31 at 3pm UTC, to discuss the proposed RAID controller option and address any outstanding questions that we have. @Volans, @elukey, @RobH, @Papaul, and myself are all on the invite titled "SMC/Wiki RAID Controller Discussion," but please let Richard from Supermicro know, if you need to propose a different meeting time. Thanks, Willy

Wed, Oct 30, 9:15 PM · SRE-swift-storage, Infrastructure-Foundations, Data-Persistence, DC-Ops
wiki_willy added a comment to T371416: Q1:rack/setup/install backup1012.

Thanks so much @jcrespo, I appreciate your flexibility and patience on this.

Wed, Oct 30, 8:12 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Tue, Oct 29

wiki_willy added a comment to T371416: Q1:rack/setup/install backup1012.

Thanks for the context, Jaime. Based on your current needs and with the time constraints, it sounds like it'll be better having you continue working on the host in its current state. While we're escalating everything with Supermicro, it's been a bit difficult getting some solid ETAs in place. There's also the possibility that unexpected issues could pop up, and I don't want to potentially delay things any further.

Tue, Oct 29, 11:23 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops
wiki_willy updated subscribers of T371416: Q1:rack/setup/install backup1012.

Hi @jcrespo - thanks for your feedback on this. My apologies that these Config J servers have been causing a lot of headaches. Unfortunately, we still have to figure out how to best resolve the performance issues from the RAID controller. In your opinion, what would work best? For example, would it work better if we set up a Config J server with the upgraded RAID controller first, and then migrated the data after? Let me know your preference, and we'll do our best to workaround and accommodate that.

Tue, Oct 29, 4:52 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops

Mon, Oct 28

wiki_willy reopened T371984: Q1:rack/setup/install backup2012 as "Open".

Re-opening this task, since we have the incorrect RAID controller on the server. @RobH is currently working with Supermicro on getting an upgraded RAID controller onsite to hopefully resolve the performance issues being seen. @RobH - please continue following up with Supermicro with ETAs and statuses, and post them here for visibility. Thanks, Willy

Mon, Oct 28, 8:18 PM · SRE, Data-Persistence, Data-Persistence-Backup, ops-codfw, DC-Ops
wiki_willy reopened T371984: Q1:rack/setup/install backup2012, a subtask of T376892: Expand media backup storage available space to 960 TB per datacenter, as Open.
Mon, Oct 28, 8:15 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, SRE
wiki_willy reopened T371416: Q1:rack/setup/install backup1012 as "Open".

Re-opening this task, as the server has the incorrect RAID controller. We're working with Supermicro to get an upgraded RAID controller sent onsite, to replace and hopefully resolve the performance issues being seen. @RobH - can you provide frequent updates in this task and work closely with Supermicro on getting the part, until we have this issue resolved? Thanks, Willy

Mon, Oct 28, 8:13 PM · SRE, Data-Persistence-Backup, Data-Persistence, ops-eqiad, DC-Ops
wiki_willy reopened T371416: Q1:rack/setup/install backup1012, a subtask of T376892: Expand media backup storage available space to 960 TB per datacenter, as Open.
Mon, Oct 28, 8:12 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, SRE

Oct 23 2024

wiki_willy added a project to T309598: hosts have Mutiple PTR records : Infrastructure-Foundations.
Oct 23 2024, 8:21 PM · Infrastructure-Foundations, DC-Ops
wiki_willy added a comment to T377568: wmcs codfw hardware changes proposal.

Yup, agreed. If the servers can be reallocated for something else that is currently needed, I think it makes more sense to just repurpose them vs keeping them as spares or decommissioning them.

Oct 23 2024, 6:17 PM · Cloud-VPS, User-aborrero, cloud-services-team (Hardware)

Sep 28 2024

wiki_willy added a comment to T375842: decommission mw[1349-1413].

Sure, no problem @akosiaris. I'm having trouble finding the line item though for wikikube-ctrl1001 on the procurement doc. Is it part of the "Refresh of mw[1349-1413]"?

Sep 28 2024, 1:02 AM · decommission-hardware

Sep 26 2024

wiki_willy added a comment to T373993: CPU temperature issues in cp hosts.

Thanks for providing all the details on this, @ssingh. @RobH - as we chatted about earlier today, we could ask Ascenty to double-check that there are enough perf tiles in the cold aisle, confirm that the blanket panels are in place (and if not, add them), and possibly get a temperature and humidity reading in that area. Thanks, Willy

Sep 26 2024, 4:03 AM · SRE, ops-esams, ops-magru, DC-Ops, Traffic

Sep 25 2024

wiki_willy added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

Thanks @dcaro. @Jclark-ctr is out the rest of this week, but should be able to ship these out when he's back next week.

Sep 25 2024, 3:04 PM · Ceph, cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS

Sep 23 2024

wiki_willy updated subscribers of T375257: Degraded RAID on es1022.

Hi @ABran-WMF - can you check with the onsite engineers @VRiley-WMF and @Jclark-ctr? Please also keep in mind this server is due to be refreshed in Q2, so a new system will be on its way in another month or so.

Sep 23 2024, 5:04 PM · SRE, DBA, ops-eqiad, DC-Ops
wiki_willy updated subscribers of T375382: Post pc1013 crash.

@Jclark-ctr & @VRiley-WMF, who can see if there are any parts available from decommissioned servers

Sep 23 2024, 4:58 PM · Wikimedia-production-error, Sustainability (Incident Followup), SRE, DBA

Sep 17 2024

wiki_willy created T375000: Repurposing 2x Decommissioned Servers for Phasing Out Puppet 5.
Sep 17 2024, 6:51 PM · SRE, ops-eqiad, DC-Ops

Sep 12 2024

wiki_willy added a comment to T362922: Audit/consider enabling CPU performance governor on DPE SRE-owned hosts.

@Jclark-ctr and @VRiley-WMF - can you confirm if we're ok with the Data Platform team increasing power on the hosts listed above? Thanks, Willy

Sep 12 2024, 9:47 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
wiki_willy assigned T373993: CPU temperature issues in cp hosts to RobH.
Sep 12 2024, 5:28 PM · SRE, ops-esams, ops-magru, DC-Ops, Traffic
wiki_willy added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

It looks like it'll be 3 drives minimum from the latest email today, and @Jclark-ctr - you can find the shipping label from Dawn's email on Sept 10. @dcaro - just let us know whenever the cluster is back up and how many disks you prefer to send out. Thanks, Willy

Sep 12 2024, 4:43 PM · Ceph, cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS

Sep 3 2024

wiki_willy added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

Thanks @dcaro, sounds good. I'll bug them again about the drive number, if we don't hear back by mid-week.

Sep 3 2024, 3:35 PM · Ceph, cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS

Aug 29 2024

wiki_willy added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

Hi @dcaro - just following up on this to see if you were ok with shipping these WMCS drives with data on them, back to Dell for identifying the root cause? From Dell's last email a couple weeks ago, they stated that they have a NDA with Hynix, along with the NDA with Wikimedia, which should cover any security concerns. To ensure we don't lose momentum, during my call with Dell today, I asked them to provide the number of drives they need and also a shipping label on where to send them to. Let us know though if you feel comfortable with sending the disks. Thanks, Willy

Aug 29 2024, 7:58 PM · Ceph, cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS

Aug 12 2024

wiki_willy assigned T372208: Degraded RAID on es1029 to VRiley-WMF.

@VRiley-WMF - fyi, this one looks like it's high priority

Aug 12 2024, 2:51 PM · DBA, DC-Ops, SRE, ops-eqiad

Jul 18 2024

wiki_willy added a comment to T360356: Request access to servers Dcops group.

Thanks @elukey, that sounds good!

Jul 18 2024, 12:02 AM · User-Elukey, SRE, Infrastructure-Foundations

Jul 17 2024

wiki_willy updated subscribers of T369855: db1179 crashed - hardware issues.

Hi @ABran-WMF - can you work with the onsite engineers on this? cc'ing @VRiley-WMF & @Jclark-ctr

Jul 17 2024, 2:47 PM · SRE, DC-Ops, ops-eqiad, DBA

Jul 16 2024

wiki_willy updated subscribers of T364429: Q4:rack/setup/install an-conf100[4-6].
Jul 16 2024, 8:07 PM · SRE, Data-Engineering, ops-eqiad, DC-Ops

Jul 12 2024

wiki_willy added a comment to T363576: Broadcom NICs with recent firmware fail to reimage.

Thanks for testing this out @Papaul. Since it appears that upgrading the WMF environment to PXELINUX version 6.04 may fix this issue, who would be the best person to help us get that upgraded?

Jul 12 2024, 10:58 PM · Patch-For-Review, User-Elukey, DC-Ops, ops-codfw, Infrastructure-Foundations, SRE

Jul 11 2024

wiki_willy added a comment to T362033: Degraded RAID on aqs1013.

Hi @Eevans - I'll let @Jclark-ctr and @VRiley-WMF confirm your first two questions. From some of the feedback I've received though, it seems that the issue on both hosts started occurring after the drives first failed on both hosts. Since it's a software RAID, it makes me wonder if there might be an issue on that end of things. Would it be possible to test things out in a hardware RAID setup? In the meantime, I'm going to bump up the refresh of aqs1010 to Q1, so you can try using that server as a replacement to either aqs1013 or aqs1014 (your choice) to see how it responds.

Jul 11 2024, 8:18 PM · DC-Ops, Cassandra, SRE, ops-eqiad
wiki_willy updated subscribers of T369825: 10gbit nic option for centrallog1002.

@VRiley-WMF & @Jclark-ctr - can you see if we have any spare 10g NICs from decommissioned servers for this?

Jul 11 2024, 4:31 PM · SRE, DC-Ops, ops-eqiad
wiki_willy updated subscribers of T369826: 10gbit nic option for centrallog2002.

@Jhancock.wm & @Papaul - can you see if we have any spare 10g NICs from decommissioned servers for this?

Jul 11 2024, 4:31 PM · SRE, ops-codfw, DC-Ops

Jul 10 2024

wiki_willy added a comment to T360356: Request access to servers Dcops group.

Thanks for the input @cmooney. All your suggestions sound good to me, so feel free to swap out ifconfig with ip, bridge, traceroute, and lldpctl. Thanks!

Jul 10 2024, 3:02 PM · User-Elukey, SRE, Infrastructure-Foundations

Jul 3 2024

wiki_willy added a comment to T360356: Request access to servers Dcops group.

Thanks so much @elukey for putting this proposal together, and for the chat during office hours today. I like the entire idea, and will run it by the rest of the team during our staff meeting next week. For the first bullet around ssh access to all production nodes for a minimal list of read only sudo commands, I think we can just go ahead and proceed with this part. It'll be really beneficial in helping the Dc-Ops engineers troubleshoot/diagnose issues. My only ask here is to see if it's possible expand the list of read only commands to include the following: dmesg, dmidecode, smartctl, nvme, edac-util, mdadm, ledctl, free, uptime, df, top, uname, ipmi-sensors, dhcp, ping, ifconfig. And if we're able to implement this part within a couple weeks, that'll be terrific.

Jul 3 2024, 5:10 PM · User-Elukey, SRE, Infrastructure-Foundations
wiki_willy added a comment to T362033: Degraded RAID on aqs1013.

Hi @Eevans - since we've replaced all hardware parts on this host, and the error is still showing up, it doesn't seem like it's a hardware problem. It's also really odd that aqs1014 is also failing on the same exact drive slot. Have you looked into possible software or configuration issues with the software RAID that could be contributing to this? Also, were there any upgrades, maintenances, or any changes that happened right before the drive had first failed?

Jul 3 2024, 12:15 AM · DC-Ops, Cassandra, SRE, ops-eqiad

Jun 20 2024

wiki_willy added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

During my call with the Dell Account team today, I asked them to push on this a bit more. The Dell Tech Support engineer hasn't been able to replicate the issue on his end, but I asked the Account team what the ramifications would be for Dell if they were to just ship us the 100 replacement disks for all 14x servers (ie: would they not be able to RMA it with the drive manufacturer, etc.). So, they're going to follow up and get back to me next week. Thanks, Willy

Jun 20 2024, 5:57 PM · Ceph, cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS

Jun 19 2024

wiki_willy added a comment to T304483: PXE boot NIC firmware regression .

Hi @Papaul - can you add the Dell Support ticket that you created in this Phabricator task, and provide any updates/progress on how that's going? Thanks, Willy

Jun 19 2024, 4:16 PM · Infrastructure-Foundations, DC-Ops

Jun 18 2024

wiki_willy updated subscribers of T367854: db1165 network flapping issues.
Jun 18 2024, 4:24 PM · SRE, ops-eqiad, DC-Ops, DBA

Jun 11 2024

wiki_willy updated subscribers of T367232: netbox model cleanup and additions June 2024.

Cool, thanks @RobH. Adding @VRiley-WMF and @Jhancock.wm for visibility also, since I think they were working on this

Jun 11 2024, 9:32 PM · DC-Ops

Jun 10 2024

wiki_willy added a comment to T358542: Netbox errors caused by system board replacement .

Thanks @Volans, will do on the remaining Netbox errors.

Jun 10 2024, 6:06 PM · Patch-For-Review, Infrastructure-Foundations, DC-Ops, SRE-tools, SRE, ops-codfw

Jun 6 2024

wiki_willy added a comment to T360895: Memory upgrade request for prometheus200[56].

T354685 looks like it was upgraded in January, but this task was created afterwards on March 25. @herron - do you still need this request done?

Jun 6 2024, 10:15 PM · DC-Ops, SRE, ops-codfw, Observability-Metrics
Restricted Application added a project to T360895: Memory upgrade request for prometheus200[56]: DC-Ops.

@Papaul & @Jhancock.wm - was this one completed already via a different task?

Jun 6 2024, 9:54 PM · DC-Ops, SRE, ops-codfw, Observability-Metrics
wiki_willy added a comment to T366102: Patch circiut CRT-008647.

Valerie is on vacation, so assigning to John

Jun 6 2024, 9:34 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops, netops
wiki_willy assigned T366102: Patch circiut CRT-008647 to Jclark-ctr.
Jun 6 2024, 9:32 PM · SRE, Infrastructure-Foundations, ops-eqiad, DC-Ops, netops
wiki_willy added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

Ok, got it. Thanks for the info @dcaro. And just to confirm, cloudcephosd1001-1020 have the same hardware configuration (only with different drive manufacturers), and don't have any of the same issues as cloudcephosd1021-1034? Let's see what the Dell team comes back with after escalating up, and hopefully we can make some more headway there.

Jun 6 2024, 6:49 PM · Ceph, cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS
wiki_willy added a comment to T348643: cloudcephosd1021-1034: hard drive sector errors increasing.

During my sync up call with Dell today, I asked our account team to see if they could push a bit more to get more hard drives RMA'd. The servers are still under warranty for a few more months, and they're going to escalate it up the chain a bit more, to see what they're going to do. In the meantime though, can we look into if something else might've changed when all these drives started having bad sectors? It looks we installed this batch of servers back in December 2021, then they were put in production in 2022. So it seems like they were running ok for a year, until the drive errors started popping up at the end of 2023.

Jun 6 2024, 4:58 PM · Ceph, cloud-services-team (FY2024/2025-Q1-Q2), SRE, ops-eqiad, DC-Ops, Cloud-VPS

Jun 5 2024

wiki_willy updated subscribers of T364870: Q4:rack/setup/install new cloudcephmon hosts.

Hi @dcaro - just following up on this. Can you provide the racking information for us, to start this install?

Jun 5 2024, 7:08 PM · Patch-For-Review, SRE, cloud-services-team (Hardware), ops-eqiad, DC-Ops

May 29 2024

wiki_willy updated the task description for T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G.
May 29 2024, 6:49 PM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad
wiki_willy updated the task description for T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G.
May 29 2024, 6:49 PM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops
wiki_willy shifted T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G from the Restricted Space space to the S1 Public space.
May 29 2024, 6:48 PM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops
wiki_willy moved T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G from Procurement to Backlog on the ops-codfw board.
May 29 2024, 6:48 PM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops
wiki_willy shifted T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G from the Restricted Space space to the S1 Public space.
May 29 2024, 6:48 PM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad
wiki_willy assigned T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G to VRiley-WMF.
May 29 2024, 6:46 PM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad
wiki_willy moved T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G from Procurement to Backlog on the ops-eqiad board.
May 29 2024, 6:46 PM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad
wiki_willy reassigned T366205: codfw:(3) wikikube-ctrl NIC upgrade to 10G from RobH to Papaul.

Removing the procurement tag, since we have 10g cards available from decom'd hosts. @Papaul - can you work with @kamila on getting these upgraded and migrated to 10g switches (if needed)? Thanks, Willy

May 29 2024, 6:44 PM · SRE-OnFire, Sustainability (Incident Followup), serviceops, ops-codfw, DC-Ops
wiki_willy removed a project from T366204: eqiad:(3) wikikube-ctrl NIC upgrade to 10G: procurement.

Removing the procurement project tag. We have spares from decom'd servers that we can use for this, instead of purchasing the 10g cards. @VRiley-WMF - can you work with @kamila on getting these hosts upgraded and moved to 10g switches?

May 29 2024, 6:42 PM · SRE, Sustainability (Incident Followup), SRE-OnFire, DC-Ops, serviceops, ops-eqiad

May 24 2024

wiki_willy updated subscribers of T362922: Audit/consider enabling CPU performance governor on DPE SRE-owned hosts.

Thanks for the heads up @bking. I went ahead and checked Netbox, just to ensure all the servers were dispersed pretty evenly across the different racks...which they are (listed below is the rack and the quantity of servers in each rack). For reference, the bolded line items are the racks that are currently pulling a bit more on power. We could do a before and after snapshot using Grafana (https://grafana.wikimedia.org/d/f64mmDzMz/power-usage?orgId=1&from=now-30d&to=now), though I have feeling we should still be ok with the increased power.

May 24 2024, 7:09 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)

Apr 15 2024

wiki_willy closed T296966: eqiad: Master Tracking Ticket for eqiad expansion cage as Resolved.

Since the only thing remaining in this task is bringing up the Dell switches in racks E8 and F8 (which I believe the Network SRE team is working on), I'm going to go ahead and resolve the main tracking ticket. Thanks, Willy

Apr 15 2024, 5:21 PM · SRE, ops-eqiad, DC-Ops

Apr 3 2024

wiki_willy added a comment to T336320: scrape RT ticket HTML files.

Sure, no prob @LSobanski. Here's the list of the 24 active devices that still reference RT tasks in Netbox, along with their purchase dates (network equipment usually EOLs every 8yrs):

Apr 3 2024, 7:17 PM · collaboration-services

Apr 2 2024

wiki_willy added a comment to T336320: scrape RT ticket HTML files.

Thanks for checking @LSobanski. It's definitely rare that we need to refer back to RT. In the last 5 years, the 2-3 cases that we've had to reference RT was typically due to tracking down information about core routers that we had purchased back then. In Netbox, we only have 24 active devices left that still reference RT tasks. As long as we're able to access these in someway (ideally quickly and easily) on the rare occasions that it's needed, you should be able to proceed with moving forward.

Apr 2 2024, 7:22 PM · collaboration-services

Mar 19 2024

wiki_willy added a comment to T360297: Take advantage of 10Gb NICs in the new network stack.

Hi @elukey - do you want me to change the Lift Wing expansion requests for 16x servers in FY24-25 to 10g? Thanks, Willy

Mar 19 2024, 4:41 PM · Infrastructure-Foundations, DC-Ops, netops

Mar 13 2024

wiki_willy updated subscribers of T359940: hw troubleshooting: Unidentified for db1246.eqiad.wmnet.

@VRiley-WMF & @Jclark-ctr for troubleshooting the hardware. (host was installed a few quarters ago)

Mar 13 2024, 2:09 PM · DBA, SRE, ops-eqiad, DC-Ops

Mar 5 2024

wiki_willy added a comment to T358542: Netbox errors caused by system board replacement .

Sounds good. @Jhancock.wm - I created a new sheet below, with the following fields. I entered in the hostnames and asset tag, but can you fill in the remaining items for old S/N, new S/N, and Phabricator Task?

Mar 5 2024, 12:06 AM · Patch-For-Review, Infrastructure-Foundations, DC-Ops, SRE-tools, SRE, ops-codfw

Mar 4 2024

wiki_willy added a comment to T358542: Netbox errors caused by system board replacement .

Thanks for confirming, @Volans. If everyone else is ok with making the correlation on the accounting spreadsheet, my vote is that we go with that route. Thanks, Willy

Mar 4 2024, 10:06 PM · Patch-For-Review, Infrastructure-Foundations, DC-Ops, SRE-tools, SRE, ops-codfw

Mar 1 2024

wiki_willy added a comment to T358542: Netbox errors caused by system board replacement .

Thanks @Volans, that makes sense. My preference would be to leave Netbox as is, and use the accounting spreadsheet to make the S/N connection to each other. Would we be adding a different tab on the accounting spreadsheet for that?

Mar 1 2024, 12:39 AM · Patch-For-Review, Infrastructure-Foundations, DC-Ops, SRE-tools, SRE, ops-codfw

Feb 29 2024

wiki_willy added a comment to T358542: Netbox errors caused by system board replacement .

If we change the serial number, I think it would create an error for S/N / Asset tag mismatch. (related to Riccardo's points earlier) We also reference the original chassis S/N when dealing with vendors for recycling servers (estimates, official documentation, etc) and purchasing replacement parts, so I'm still a bit hesitant with editing the S/N in Netbox as the solution. Since it doesn't sound like we receive any Netbox alerts when we replacing with a new motherboard, is there something that we could tweak to replicate the same thing? (ie: change the status or something of the donor server) Or worse case, just suppress these alerts somehow, until they eventually decommission?

Feb 29 2024, 9:26 PM · Patch-For-Review, Infrastructure-Foundations, DC-Ops, SRE-tools, SRE, ops-codfw

Feb 28 2024

wiki_willy added a comment to T358542: Netbox errors caused by system board replacement .

Hey @Volans - much appreciated for your feedback and for the suggestions. I was wondering since the physical serial number listed on the chassis doesn't change (it's only from a Puppet perspective that the serial number changes), is there anything on the Puppet side that could be modified to reflect the MB replacement? If there's something easy that could be done in Puppet to prevent the Netbox error from alerting, I kind of feel like it would be a more accurate representation.

Feb 28 2024, 11:18 PM · Patch-For-Review, Infrastructure-Foundations, DC-Ops, SRE-tools, SRE, ops-codfw
wiki_willy updated subscribers of T358727: Reclaim recently-decommed CP host for WDQS (see T352253).

@VRiley-WMF and @Jclark-ctr - can one of you pick up this request? We'll be repurposing one of the previously decommissioned cp servers to set up a temp server for Adam to use. Thanks, Willy

Feb 28 2024, 10:02 PM · Discovery-Search (Current work), Data-Platform-SRE (2024.03.04 - 2024.03.24), wmde-wikidata-tech, Wikidata, SRE, ops-eqiad
wiki_willy added a comment to T352253: Decommission task for old cp hosts (cp1075-1090).

Sounds good @bking, thanks!

Feb 28 2024, 8:59 PM · SRE, ops-eqiad, DC-Ops, Traffic
wiki_willy added a comment to T358533: Hardware requests for Search Platform FY2024-2025.

Hi @bking - thanks for coming up with the list. I have the following refreshes already on the CapEx doc, so you just have to fill in the missing columns for "Hardware Config", "Network Speed" and "Total Equipment Cost" (for custom configs)

Feb 28 2024, 5:17 PM · Data-Platform-SRE (2024.03.25 - 2024.04.14)

Feb 27 2024

wiki_willy added a comment to T358421: db2118 crashed and rebooted due to HW.

Thanks for picking this up @Jhancock.wm. @Marostegui - since this host looks like it's close to being refreshed in T355350, do you want to just wait for the refreshed server to be setup instead of fixing this one? Thanks, Willy

Feb 27 2024, 2:10 AM · Wikimedia-Incident, DBA, SRE

Feb 26 2024

wiki_willy updated subscribers of T358421: db2118 crashed and rebooted due to HW.

@wiki_willy can we contact the vendor about this issue which caused a reboot?

Record:      27
Date/Time:   02/24/2024 10:08:18
Source:      system
Severity:    Critical
Description: CPU 1 machine check error detected.
Feb 26 2024, 6:19 PM · Wikimedia-Incident, DBA, SRE

Feb 23 2024

wiki_willy added a comment to T352253: Decommission task for old cp hosts (cp1075-1090).

Hi @ssingh - the hardware should still be around, and we should be able to reallocate one of them for testing purposes. Can you shoot open a new Phabricator for us with all the necessary details (hostname, racking info, network setup, raid/partitioning, OS, and main poc)? Also, do you know how long Adam would need it for?

Feb 23 2024, 5:51 PM · SRE, ops-eqiad, DC-Ops, Traffic

Feb 21 2024

wiki_willy added a project to T357951: db2137 and es2026 don't get an IP via PXE boot: ops-codfw.

@Jhancock.wm for visibility and in case any onsite support is needed

Feb 21 2024, 3:56 PM · SRE, ops-codfw, DC-Ops

Feb 8 2024

wiki_willy assigned T357015: Degraded RAID on db2194 to Jhancock.wm.

@Jhancock.wm

Feb 8 2024, 5:14 PM · DBA, SRE, ops-codfw

Jan 10 2024

wiki_willy added a comment to T354606: Investigate memory increase for Prometheus hosts in codfw/eqiad.

Thanks @VRiley-WMF. I have T354684 assigned over to you, so you can work with @fgiunchedi on coordinating downtime for the upgrades. Thanks, Willy

Jan 10 2024, 9:35 PM · SRE, ops-codfw, ops-eqiad, Observability-Metrics
wiki_willy assigned T354684: RAM upgrade for prometheus100[56] to VRiley-WMF.
Jan 10 2024, 9:32 PM · SRE, ops-eqiad, Observability-Metrics

Jan 9 2024

wiki_willy added a comment to T354606: Investigate memory increase for Prometheus hosts in codfw/eqiad.

Awesome, thanks @Jhancock.wm. Here's the codfw upgrade ticket for you to coordinate with @fgiunchedi on the downtime - T354685. Thanks, Willy

Jan 9 2024, 6:36 PM · SRE, ops-codfw, ops-eqiad, Observability-Metrics
wiki_willy assigned T354685: RAM upgrade for prometheus200[56] to Jhancock.wm.
Jan 9 2024, 6:34 PM · SRE, ops-codfw, Observability-Metrics
wiki_willy updated subscribers of T354591: db1224 crashed - hardware error.

@Jclark-ctr & @VRiley-WMF

Jan 9 2024, 5:24 PM · SRE, DC-Ops, ops-eqiad, DBA
wiki_willy updated subscribers of T354606: Investigate memory increase for Prometheus hosts in codfw/eqiad.

@Papaul / @Jhancock.wm and @Jclark-ctr / @VRiley-WMF - can you see if you have any spare memory onsite for Filippo? I think it's for prometheus100[5,6] and prometheus200[5,6]. (cc @RobH in case we have to order them)

Jan 9 2024, 5:22 PM · SRE, ops-codfw, ops-eqiad, Observability-Metrics

Dec 15 2023

wiki_willy updated subscribers of T353503: ps1-e8-eqiad down.

@Jclark-ctr or @VRiley-WMF - can one of you take a look at this one?

Dec 15 2023, 9:00 PM · SRE, ops-eqiad

Dec 7 2023

wiki_willy updated subscribers of T353020: Degraded RAID on db1168.

Definitely. @Jclark-ctr & @VRiley-WMF - can you check if we have any spare drives from a decommissioned host? If not, we'll purchase one via @RobH). Thanks, Willy

Dec 7 2023, 8:50 PM · DBA, SRE, ops-eqiad
wiki_willy updated subscribers of T351891: Abstract a bit more the server provisioning process.
Dec 7 2023, 6:05 PM · Infrastructure-Foundations, SRE-tools

Dec 1 2023

wiki_willy closed Unknown Object (Task), a subtask of T329219: Main Tracking Task for ESAMS Migration to KNAMS, as Resolved.
Dec 1 2023, 10:04 PM · Patch-For-Review, SRE, ops-esams, DC-Ops

Nov 29 2023

wiki_willy updated subscribers of T352238: Degraded RAID on db1199.

@Jclark-ctr & @VRiley-WMF - can one of you two work on getting the drive RMA'd for this one? Thanks, Willy

Nov 29 2023, 8:36 AM · DBA, SRE, ops-eqiad

Nov 23 2023

wiki_willy closed Unknown Object (Task), a subtask of T346722: Sao Paulo, Brazil, South America POP tracking task, as Resolved.
Nov 23 2023, 12:06 AM · ops-magru

Nov 22 2023

wiki_willy assigned T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting to Jclark-ctr.
Nov 22 2023, 8:09 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Nov 10 2023

wiki_willy added a comment to T350885: Project future physical host usage for Search Platform-owned services.

Thanks for working on this @bking. I'm mainly looking to see how much future growth you're looking at (a rough estimate is fine), if you have any requests for the type of servers we provide (ie: ARM, GPU, etc), or just have any feedback for us in general. We're getting pretty full at codfw, so when we purchase additional data center space, we want to ensure we're adding enough capacity for everyone's future needs over the next 3-5yrs. Thanks, Willy

Nov 10 2023, 1:24 AM · Data-Platform-SRE

Oct 30 2023

wiki_willy added a comment to T349756: Audit of WMCS Servers Using Single & Dual Switchports.

Awesome, thanks for working on this @VRiley-WMF. @nskaggs & @cmooney - since we have some discrepancies with the number of ports being used on these cloudvirts, should we come up with a plan/process to help us free up the second switchport on them? This will help us reclaim some switchports for new installs and server migrations. Thanks, Willy

Oct 30 2023, 7:48 PM · SRE, ops-eqiad, DC-Ops

Oct 25 2023

wiki_willy created T349756: Audit of WMCS Servers Using Single & Dual Switchports.
Oct 25 2023, 7:58 PM · SRE, ops-eqiad, DC-Ops

Oct 17 2023

wiki_willy updated subscribers of T308339: eqiad: move non WMCS servers out of rack C8.

@Jclark-ctr or @VRiley-WMF - can one of you follow up on Ben's question above on an-tool1010, along with Alex's comment on deploy1102? Thanks, Willy

Oct 17 2023, 9:02 PM · SRE, DBA, ops-eqiad

Oct 3 2023

wiki_willy updated subscribers of T306007: Avoid ghost hosts on the network.

@Papaul , who's going to dig around a bit and provide some feedback

Oct 3 2023, 9:49 PM · SRE, Infrastructure-Foundations, netbox, netops, DC-Ops

Aug 30 2023

wiki_willy assigned T344597: Decommission thumbor200[34] to Jhancock.wm.
Aug 30 2023, 5:12 PM · SRE, serviceops, ops-codfw

Aug 11 2023

wiki_willy updated subscribers of T344076: Increase VM size for wikitech-static.

Hi @Andrew - I don't have Rackspace under my budget. I think that one falls under the SRE budget, so you may to reach out to @mark on that one.

Aug 11 2023, 9:45 PM · Sustainability (Incident Followup), cloud-services-team

Aug 2 2023

wiki_willy updated subscribers of T343254: codfw: es2025 lost System Board Fan6.

It's not on the refresh list for this fiscal year; looks like it'll be due for a refresh in FY24-25. If the firmware upgrade on the iDrac doesn't work, we can try sourcing the fan if you want. (cc @RobH)

Aug 2 2023, 4:01 PM · SRE, ops-codfw, DBA

Jul 31 2023

wiki_willy updated the task description for T329219: Main Tracking Task for ESAMS Migration to KNAMS.
Jul 31 2023, 10:09 PM · Patch-For-Review, SRE, ops-esams, DC-Ops