Page MenuHomePhabricator

Upgrade the Hadoop masters to Debian Buster
Closed, ResolvedPublic

Description

an-master100[1,2] need to be upgraded to Buster, one at the time preserving data in the partitions.

There are a couple of tasks that we might want to do before:

Then we should create the partman recipe to preserve the partitions/volumes that we care about, and proceed with the reimage!

Event Timeline

@razzi in T265126 I reshaped the an-master100* partitions, now they look like this:

elukey@an-master1001:~$ sudo lsblk -i
NAME                         MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                            8:0    0 223.6G  0 disk  
|-sda1                         8:1    0  46.6G  0 part  
| `-md0                        9:0    0  46.5G  0 raid1 /
|-sda2                         8:2    0   954M  0 part  
| `-md1                        9:1    0 953.4M  0 raid1 [SWAP]
`-sda3                         8:3    0 176.1G  0 part  
  `-md2                        9:2    0   176G  0 raid1 
    `-an--master1001--vg-srv 253:0    0   176G  0 lvm   /srv
sdb                            8:16   0 223.6G  0 disk  
|-sdb1                         8:17   0  46.6G  0 part  
| `-md0                        9:0    0  46.5G  0 raid1 /
|-sdb2                         8:18   0   954M  0 part  
| `-md1                        9:1    0 953.4M  0 raid1 [SWAP]
`-sdb3                         8:19   0 176.1G  0 part  
  `-md2                        9:2    0   176G  0 raid1 
    `-an--master1001--vg-srv 253:0    0   176G  0 lvm   /srv

elukey@an-master1002:~$ sudo lsblk -i
NAME                         MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                            8:0    0 223.6G  0 disk  
|-sda1                         8:1    0  46.6G  0 part  
| `-md0                        9:0    0  46.5G  0 raid1 /
|-sda2                         8:2    0   954M  0 part  
| `-md1                        9:1    0 953.4M  0 raid1 [SWAP]
`-sda3                         8:3    0 176.1G  0 part  
  `-md2                        9:2    0   176G  0 raid1 
    `-an--master1002--vg-srv 253:0    0   176G  0 lvm   /srv
sdb                            8:16   0 223.6G  0 disk  
|-sdb1                         8:17   0  46.6G  0 part  
| `-md0                        9:0    0  46.5G  0 raid1 /
|-sdb2                         8:18   0   954M  0 part  
| `-md1                        9:1    0 953.4M  0 raid1 [SWAP]
`-sdb3                         8:19   0 176.1G  0 part  
  `-md2                        9:2    0   176G  0 raid1 
    `-an--master1002--vg-srv 253:0    0   176G  0 lvm   /srv

We can come up with a reuse partition recipe and then schedule the Buster upgrades for the master nodes. High level things to do:

  1. Come up with a reuse recipe config for partman (you can take as examples the other ones that we have in puppet for analytics, and the work that you are doing for flerovium/furud)
  2. Think about an upgrade procedure, keeping in mind what are the risks and what can be done ahead of time to avoid problems as much as possible. What is the worst case scenario? How can we avoid it? Etc.. :)

I am going to help as much as possible but I'll let you drive this! If possible let's do it asap, the upgrade to Buster should take the priority on other things.

Change 682785 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] netboot: Add reuse recipe to preserve /srv on an-master

https://gerrit.wikimedia.org/r/682785

Alright, here's my plan @elukey, perhaps we can discuss this next week and if it looks good we can plan the maintenance.

Prep

Backup /srv/hadoop/name from an-master1001 (11G), could be to my home directory on a statbox
Check for any hadoop-related alarms
Before doing anything, let team know we're ready to begin and confirm with Luca / Andrew

Test failover and back

(I think it's worth testing that an-master1002 can be active while an-master1001 is still running)

Check that 1001 is active and 1002 is standby
Do failover on an-master1001:

  • systemctl stop hadoop-hdfs-namenode
  • systemctl stop hadoop-yarn-resourcemanager

Check that 1002 became active: sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet

Check metrics: https://grafana.wikimedia.org/d/000000585/hadoop:

  • HDFS Namenode
  • Yarn Resource Manager

On an-master1002:

  • systemctl stop hadoop-hdfs-namenode
  • systemctl stop hadoop-yarn-resourcemanager

The reimage itself

Disable puppet on an-master1001 and an-master1002
Merge puppet patch: https://gerrit.wikimedia.org/r/c/operations/puppet/ /682785/
Run puppet on install1003 to ensure this change is picked up
Failover hdfs and yarn to an-master1002:

  • systemctl stop hadoop-hdfs-namenode
  • systemctl stop hadoop-yarn-resourcemanager
  • systemctl stop hadoop-hdfs-zkfc
  • systemctl stop hadoop-mapreduce-historyserver

Check that an-master1002 is active as expected, wait a moment, check with team to make sure everything looks healthy.

Start reimage on cumin1001: sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet

Since this has reuse-partitions-test, will have to connect to console and confirm that partitions look good (potentially destructive step, check with Luca before proceeding)

Once machine comes up, confirm that proper os version is installed, hadoop services are running, /srv partition has data, node is in standby state. Since machine was down, hadoop namenode service will need to catch up. This should show in hdfs under-replicated blocks: https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=41&orgId=1

Once everything looks good, manually failover 1002 -> 1001 (do this without stopping hdfs, so that if necessary things can switch back): sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet

Repeat reimage on an-master1002

Stop hadoop daemons on 1002, reimage, confirm that 1002 comes back as healthy standby, done!

Risks and mitigations

Risk: an-master1002 does not work as active

Mitigation: do a test failover to an-master1002 and ensure everything is working before reimaging an-master1001, so that we can switch back if necessary

Risk: active fails while standby is down

Mitigation: Backup /srv/hadoop/name? Since hdfs is constantly written to, this would get out of date, but would be better than losing all data. We could set up another standby, an-master1003, perhaps temporarily as a virtual machine. Realistically this is a low-risk scenario, but worth considering as this would be the worst scenario and could lead to data loss

Risk: hadoop doesn't work on latest debian 10

Mitigation: an-test-master is already running on debian 10, so we have some confidence this will not happen; we can go over steps to reimage back to debian 9.13

Alright, here's my plan @elukey, perhaps we can discuss this next week and if it looks good we can plan the maintenance.

Good first step, I like detailed plans! I am going to add some info/thoughts/etc.. and then I'll let you re-iterate the plan (if you'll think it is the case).

Prep

Backup /srv/hadoop/name from an-master1001 (11G), could be to my home directory on a statbox
Check for any hadoop-related alarms
Before doing anything, let team know we're ready to begin and confirm with Luca / Andrew

I would start from an-master1002, it is already standby and in case of complete wipe of the node we'll be able to just execute a HDFS boostrapStandby action and we'll be done (so we'll not need any HDFS FSImage etc..).
The /srv/hadoop/name directory represents the current state of the metadata of HDFS as seen by the masters, and it is composed by FSImages edit in progress (check /srv/hadoop/name/current/ on an-master1001). The main issue taking a copy on the fly of that directory is that we may not be sure to fully recover from it in case of need (say a file not properly closed by the Namenode, etc..). In the past we chose the following approach:

  • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace

What do the above do? I'll let you follow up :)

Other suggestion - we may want to think about if it is the case to announce maintenance and stop the in flight jobs. This is also related to the safemode command above :)

About the data copying - let's add more details about you'd do it, it is good to have all the details written down (as opposed to rush for a solution during the maintenance).

Test failover and back

(I think it's worth testing that an-master1002 can be active while an-master1001 is still running)

Check that 1001 is active and 1002 is standby
Do failover on an-master1001:

  • systemctl stop hadoop-hdfs-namenode
  • systemctl stop hadoop-yarn-resourcemanager

Check that 1002 became active: sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet

Check metrics: https://grafana.wikimedia.org/d/000000585/hadoop:

  • HDFS Namenode
  • Yarn Resource Manager

On an-master1002:

  • systemctl stop hadoop-hdfs-namenode
  • systemctl stop hadoop-yarn-resourcemanager

This can be done anytime before the maintenance, in theory we rely on it implicitly but if you want to test it 1. I'd suggest to do it at least half an hour before the upgrade, or even the day before, just to decouple this action (that is effectively a little invasive and heavy for Namenodes) from the main maintenance.
Please remember to take at least 10/15 mins between HDFS Namenode restarts, checking metrics (the GC activity of the restarted Namenode spikes after restart and then flats out again) and logs.

The reimage itself

Disable puppet on an-master1001 and an-master1002
Merge puppet patch: https://gerrit.wikimedia.org/r/c/operations/puppet/ /682785/

This can be merged anytime so the next step will not be needed :)

Run puppet on install1003 to ensure this change is picked up

To be super paranoid, always run on install*, just to update all the stacks that we use (even if in this case 1003 should be the right one as you outlined).

Failover hdfs and yarn to an-master1002:

  • systemctl stop hadoop-hdfs-namenode
  • systemctl stop hadoop-yarn-resourcemanager
  • systemctl stop hadoop-hdfs-zkfc
  • systemctl stop hadoop-mapreduce-historyserver

Check that an-master1002 is active as expected, wait a moment, check with team to make sure everything looks healthy.

Start reimage on cumin1001: sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet

Since this has reuse-partitions-test, will have to connect to console and confirm that partitions look good (potentially destructive step, check with Luca before proceeding)

Once machine comes up, confirm that proper os version is installed, hadoop services are running, /srv partition has data, node is in standby state. Since machine was down, hadoop namenode service will need to catch up. This should show in hdfs under-replicated blocks: https://grafana.wikimedia.org/d/000000585/hadoop?viewPanel=41&orgId=1

The under replicated blocks are only a problem when Datanodes are down (since some HDFS blocks disappear), we should be fine for Namenodes (but other metrics as described above should be checked).

Once everything looks good, manually failover 1002 -> 1001 (do this without stopping hdfs, so that if necessary things can switch back): sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet

Repeat reimage on an-master1002

Stop hadoop daemons on 1002, reimage, confirm that 1002 comes back as healthy standby, done!

Risks and mitigations

Risk: an-master1002 does not work as active

Mitigation: do a test failover to an-master1002 and ensure everything is working before reimaging an-master1001, so that we can switch back if necessary

Risk: active fails while standby is down

Mitigation: Backup /srv/hadoop/name? Since hdfs is constantly written to, this would get out of date, but would be better than losing all data. We could set up another standby, an-master1003, perhaps temporarily as a virtual machine. Realistically this is a low-risk scenario, but worth considering as this would be the worst scenario and could lead to data loss

We can revisit this in light of the savenamespace option above, but it is good to consider this use case. Setting up a third standby may not be possible for Hadoop (or at least, never tried it, but I'd be very hesitant to use this option), copying data seems to be a good compromise. If we have a good state (namely an FSImage) we'll be able to bootstrap another master anytime.

To follow up - How is the FSImage connected to the Journalnode edit log? How is it used by a Namenode?

Risk: hadoop doesn't work on latest debian 10

Mitigation: an-test-master is already running on debian 10, so we have some confidence this will not happen; we can go over steps to reimage back to debian 9.13

1

Almost forgot - the procedure should also include T231067#6863800 :)

Change 682785 merged by Razzi:

[operations/puppet@production] netboot: add reuse-analytics-raid1-2dev.cfg recipe for an-master and an-coord

https://gerrit.wikimedia.org/r/682785

Ok, here's my new plan, including draining the cluster and using safemode to take a stable fsimage. If this looks good to you @elukey we can pick a day at least a week away so that we can communicate the maintenance. We could do this without doing maintenance but I'd appreciate the safety and the opportunity to learn about safemode.

Updated plan

Announce maintenance:

Online failover to test that an-master1002 is healthy before we do anything destructive

  • Shouldn't need a maintenance window.
    • Check that an-master1001 is active and 1002 is standby:
      • sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
      • sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet
    • Failover to 1002 by running the following on 1001:
      • systemctl stop hadoop-hdfs-namenode
      • systemctl stop hadoop-yarn-resourcemanager
      • systemctl stop hadoop-hdfs-zkfc
      • systemctl stop hadoop-mapreduce-historyserver

Prepare cluster for maintenance (drain cluster, safe mode, snapshot, backup:

  • Disable puppet on an-master1001 and an-master1002
    • sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
  • Disable jobs on an-launcher1002
    • sudo systemctl stop 'camus-*'
    • sudo systemctl stop 'drop-*'
    • sudo systemctl stop 'hdfs-*'
    • sudo systemctl stop 'mediawiki-*'
    • sudo systemctl stop 'refine_*'
    • sudo systemctl stop 'refinery-*'
    • sudo systemctl stop 'reportupdater-*'
  • Stop oozie coordinators
  • Disable queue
    • sudo systemctl stop hadoop-yarn-resourcemanager
  • Wait 30 minutes for applications to gracefully exit
  • Kill remaining yarn applications
    • for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done
  • Enable safe mode
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • Checkpoint
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
  • Create snapshot tar
    • sudo su
    • cd /srv/hadoop/namenode
    • tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
  • Copy snapshot to elsewhere
    • (from my personal computer)
    • scp -3 an-master1001.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz thorium.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz

-- READY FOR MAINTENANCE --

Change uids on an-master1002

Reimage an-master1002

  • sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet
    • Will have to confirm that the partitions since we're using reuse-parts-test

Failover to an-master1002

  • First check that an-master1001 is standby:
    • sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
  • Stop hadoop services on an-master1001:
    • systemctl stop hadoop-hdfs-namenode
    • systemctl stop hadoop-yarn-resourcemanager
    • systemctl stop hadoop-hdfs-zkfc
    • systemctl stop hadoop-mapreduce-historyserver

Change uids on an-master1001

Reimage an-master1001

  • sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet

-- MAINTENANCE OVER --

Turn off safe mode

  • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave

Monitor metrics

Delete fsimage snapshot

  • ssh thorium.eqiad.wmnet "rm /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz"

Restart all services on an-launcher1002

  • sudo systemctl start 'camus-*'
  • sudo systemctl start 'drop-*'
  • sudo systemctl start 'hdfs-*'
  • sudo systemctl start 'mediawiki-*'
  • sudo systemctl start 'refine_*'
  • sudo systemctl start 'refinery-*'
  • sudo systemctl start 'reportupdater-*'

Questions:

  • Is there a faster way backup the snapshot? scp should take about 30 minutes, which is manageable, but this will extend the downtime, and make it slower if we have to recover
  • Is thorium a good place to store the snapshot? Thorium has plenty of space and we should be able to delete the snapshot after this maintenance

Ok, here's my new plan, including draining the cluster and using safemode to take a stable fsimage. If this looks good to you @elukey we can pick a day at least a week away so that we can communicate the maintenance. We could do this without doing maintenance but I'd appreciate the safety and the opportunity to learn about safemode.

Nice, great job! I am going to add some suggestions and info inline :)

Updated plan

Announce maintenance:

Online failover to test that an-master1002 is healthy before we do anything destructive

  • Shouldn't need a maintenance window.
    • Check that an-master1001 is active and 1002 is standby:
      • sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
      • sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet

Nit: better to use kerberos-run-command. I'd also check the status of Yarn (commands in the Admin page).

  • Failover to 1002 by running the following on 1001:
    • systemctl stop hadoop-hdfs-namenode
    • systemctl stop hadoop-yarn-resourcemanager
    • systemctl stop hadoop-hdfs-zkfc
    • systemctl stop hadoop-mapreduce-historyserver

This might not be right, since IIUC from below you are going to start the reimage from 1002. What is the purpose of the failover?

Prepare cluster for maintenance (drain cluster, safe mode, snapshot, backup:

  • Disable puppet on an-master1001 and an-master1002
    • sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
  • Disable jobs on an-launcher1002
    • sudo systemctl stop 'camus-*'
    • sudo systemctl stop 'drop-*'
    • sudo systemctl stop 'hdfs-*'
    • sudo systemctl stop 'mediawiki-*'
    • sudo systemctl stop 'refine_*'
    • sudo systemctl stop 'refinery-*'
    • sudo systemctl stop 'reportupdater-*'

These commands need the .timer at the end, otherwise you are going to stop the running jobs (and not the timers themselves, so they'll kick off new runs).

In theory this is not needed, if you stop the timers the cluster should drain after a while. There may be jobs from other teams though, worth to check.
The Search team runs Airflow on an-airflow1001, so a complete procedure might also need a puppet disable airflow scheuduler stop in there (with some sync with the team first).
If we disable the queue as described below this is probably not needed (but the cluster drain for timers is a good step anyway).

  • Disable queue
    • sudo systemctl stop hadoop-yarn-resourcemanager

This command is a little bit brutal, what we could do is something like:

  • check profile::analytics::cluster::hadoop::yarn_capacity_scheduler and add something like 'yarn.scheduler.capacity.root.default.state' => 'STOPPED'
  • send puppet patch and merge it (but at this point we are with puppet disabled, so you either add it manually or you merge it beforehand).
  • execute sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues

The above will instruct the Yarn RMs to not accept any new job.

  • Wait 30 minutes for applications to gracefully exit
  • Kill remaining yarn applications
    • for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done

This may not be needed, to be evaluated when doing maintenance. Doing a broad kill might be too much. Better to ask Joseph or anybody helping you with maintenance before pulling the trigger.

  • Enable safe mode
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • Checkpoint
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
  • Create snapshot tar
    • sudo su
    • cd /srv/hadoop/namenode
    • tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
  • Copy snapshot to elsewhere
    • (from my personal computer)
    • scp -3 an-master1001.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz thorium.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz

From cumin1001 you can use transfer.py: https://wikitech.wikimedia.org/wiki/Transfer.py
For example: sudo transfer.py an-master1001.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-xxx thorium.eqiad.wmnet:/home/razzi/

I suggest to test a transfer for some file to get familiar with the tool first (it is very useful!).

-- READY FOR MAINTENANCE --

elukey@an-master1002:~$ sudo systemctl list-timers | grep hadoop
Thu 2021-05-13 00:00:00 UTC  15h left      Wed 2021-05-12 00:00:03 UTC  8h ago       hadoop-namenode-backup-fetchimage.timer         hadoop-namenode-backup-fetchimage.service
Thu 2021-05-13 01:00:00 UTC  16h left      Wed 2021-05-12 01:00:05 UTC  7h ago       hadoop-namenode-backup-prune.timer              hadoop-namenode-backup-prune.service

The above timers need to be stopped before maintenance/reimage, as precaution.

Change uids on an-master1002

I'd add a check to see if any process is running with hdfs/yarn/etc.. (even a simple manual check with ps auxff or similar is fine) just to make sure that nothing is running (otherwise the change uids/gids may fail).

Reimage an-master1002

  • sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet
    • Will have to confirm that the partitions since we're using reuse-parts-test

At this point I would probably think about checking that all daemons are ok, logs are fine, metrics, etc.. It is ok to do the maintenance in two separate days, to leave some time for any unexpected issue to come up. We could also test a failover from 1001 (still not reimaged) to 1002 and leave it running for a few hours monitoring metrics. I know that on paper we don't expect any issue from previous tests, but this is production and there may be some corner cases that we were not able to test before.

So to summarize - at this point I'd check the status of the all the services and just re-enable timers/jobs/etc.. Then after a bit I'd failover to 1002 and test for a few hours if everything works as expected (heap pressure, logs, etc..)

Failover to an-master1002

  • First check that an-master1001 is standby:
    • sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet

In theory 1001 should be active here right?

  • Stop hadoop services on an-master1001:
    • systemctl stop hadoop-hdfs-namenode
    • systemctl stop hadoop-yarn-resourcemanager
    • systemctl stop hadoop-hdfs-zkfc
    • systemctl stop hadoop-mapreduce-historyserver

Change uids on an-master1001

Reimage an-master1001

  • sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet

-- MAINTENANCE OVER --

Turn off safe mode

  • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave

Monitor metrics

Delete fsimage snapshot

  • ssh thorium.eqiad.wmnet "rm /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(gdate --iso-8601).tar.gz"

Restart all services on an-launcher1002

  • sudo systemctl start 'camus-*'
  • sudo systemctl start 'drop-*'
  • sudo systemctl start 'hdfs-*'
  • sudo systemctl start 'mediawiki-*'
  • sudo systemctl start 'refine_*'
  • sudo systemctl start 'refinery-*'
  • sudo systemctl start 'reportupdater-*'

Running puppet will be enough, everything will be restored at its previous state.

Questions:

  • Is there a faster way backup the snapshot? scp should take about 30 minutes, which is manageable, but this will extend the downtime, and make it slower if we have to recover

See above notes about transfer.py :)

  • Is thorium a good place to store the snapshot? Thorium has plenty of space and we should be able to delete the snapshot after this maintenance

In theory the namenode's backup should be around few GBs, so any stat100x would be ok in my opinion.

The plan looks great @razzi , and the comments as well!
My nits on some small things.

In theory this is not needed, if you stop the timers the cluster should drain after a while. There may be jobs from other teams though, worth to check.
The Search team runs Airflow on an-airflow1001, so a complete procedure might also need a puppet disable airflow scheuduler stop in there (with some sync with the team first).
If we disable the queue as described below this is probably not needed (but the cluster drain for timers is a good step anyway).

I support not touching oozie jobs - pause/resume somtimes leads to jobs being stuck, and I'd rather have them waiting than being paused :)

  • Disable queue
    • sudo systemctl stop hadoop-yarn-resourcemanager

This command is a little bit brutal, what we could do is something like:

  • check profile::analytics::cluster::hadoop::yarn_capacity_scheduler and add something like 'yarn.scheduler.capacity.root.default.state' => 'STOPPED'
  • send puppet patch and merge it (but at this point we are with puppet disabled, so you either add it manually or you merge it beforehand).
  • execute sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues

The above will instruct the Yarn RMs to not accept any new job.

I like the idea of using refreshQueues instead of stopping the resource-manager.

  • Wait 30 minutes for applications to gracefully exit
  • Kill remaining yarn applications
    • for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done

This may not be needed, to be evaluated when doing maintenance. Doing a broad kill might be too much. Better to ask Joseph or anybody helping you with maintenance before pulling the trigger.

I can indeed care that aspect during the procedure :) The command looks very correct to kill all jobs nonetheless :)

Change 692465 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] yarn: temporarily stop allowing jobs to be submitted to yarn

https://gerrit.wikimedia.org/r/692465

Thanks for the reviews @elukey and @JAllemandou!

Based on your comments, my plan is to reimage an-master1002 on Tuesday May 25 before standup (14:30-16:00 UTC, might take longer, but the active effort should be wrapped up within an hour). I realize there is still a bit of synchronizing to do with the search team but I belive we'll be able to work it out before then.

I'd also like to do a test failover to an-master1002 before doing anything to the cluster. This time I marked it with "Plan to test failover and back" for clarity.

If everything looks good, we can failover to an-master1002 and continue to monitor things for a while. If anything is wrong we'll switch back to an-master1001, but if everything works, we'll disable safe mode and be free to do the reimage of an-master1001 whenever, probably over the next few days.


Responding to inline comments:

Nit: better to use kerberos-run-command. I'd also check the status of Yarn (commands in the Admin page).

Good call

This might not be right, since IIUC from below you are going to start the reimage from 1002. What is the purpose of the failover?

This is a failover to test than an-master1002 is capable of being promoted, before we start doing anything destructive to either node. In theory this can be done anytime over the next week

These commands need the .timer at the end, otherwise you are going to stop the running jobs (and not the timers themselves, so they'll kick off new runs).

Good point!

  • Stop oozie coordinators [...]

In theory this is not needed, if you stop the timers the cluster should drain after a while. There may be jobs from other teams though, worth to check.

Cool, I'll leave out oozie jobs from this part.

The Search team runs Airflow on an-airflow1001, so a complete procedure might also need a puppet disable airflow scheuduler stop in there (with some sync with the team first).

Alright, I've reached out to the search team, and will ask them what needs to be disabled on their end.

Disable queue

  • sudo systemctl stop hadoop-yarn-resourcemanager

This command is a little bit brutal, what we could do is something like:

  • check profile::analytics::cluster::hadoop::yarn_capacity_scheduler and add something like 'yarn.scheduler.capacity.root.default.state' => 'STOPPED'
  • send puppet patch and merge it (but at this point we are with puppet disabled, so you either add it manually or you merge it beforehand).
  • execute sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues

Alright, I made a puppet patch to change the root queue to STOPPED: https://gerrit.wikimedia.org/r/c/operations/puppet/ /692465

Nice that yarn provides a way to reload this kind of xml setting. Will add this to the plan.

Wait 30 minutes for applications to gracefully exit

  • Kill remaining yarn applications [...]

This may not be needed, to be evaluated when doing maintenance. Doing a broad kill might be too much. Better to ask Joseph or anybody helping you with maintenance before pulling the trigger.

Alright, I'll leave this to Joseph :)

From cumin1001 you can use transfer.py

Cool, that looks good, will use that. Tested performance and a ~10G fsimage will still take ~40 minutes (in my test, 500M transferred in ~2 minutes).

elukey@an-master1002:~$ sudo systemctl list-timers | grep hadoop
Thu 2021-05-13 00:00:00 UTC  15h left      Wed 2021-05-12 00:00:03 UTC  8h ago       hadoop-namenode-backup-fetchimage.timer         hadoop-namenode-backup-fetchimage.service
Thu 2021-05-13 01:00:00 UTC  16h left      Wed 2021-05-12 01:00:05 UTC  7h ago       hadoop-namenode-backup-prune.timer              hadoop-namenode-backup-prune.service

The above timers need to be stopped before maintenance/reimage, as precaution.

Alright, I guess it'll prevent any incomplete file actions during maintenance.

Change uids on an-master1002

I'd add a check to see if any process is running with hdfs/yarn/etc.. (even a simple manual check with ps auxff or similar is fine) just to make sure that nothing is running (otherwise the change uids/gids may fail).

Sounds good, will add this to the plan.

Reimage an-master1002

  • sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet
    • Will have to confirm that the partitions since we're using reuse-parts-test

At this point I would probably think about checking that all daemons are ok, logs are fine, metrics, etc.. It is ok to do the maintenance in two separate days, to leave some time for any unexpected issue to come up. We could also test a failover from 1001 (still not reimaged) to 1002 and leave it running for a few hours monitoring metrics. I know that on paper we don't expect any issue from previous tests, but this is production and there may be some corner cases that we were not able to test before.

So to summarize - at this point I'd check the status of the all the services and just re-enable timers/jobs/etc.. Then after a bit I'd failover to 1002 and test for a few hours if everything works as expected (heap pressure, logs, etc..)

Yeah, I think it's a good idea to pause here and monitor for issues. I'll update the plan to only reimage an-master1002 at first.

Failover to an-master1002

  • First check that an-master1001 is standby:
    • sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet

In theory 1001 should be active here right?

Yep, good catch :)

Restart all services on an-launcher1002

  • sudo systemctl start 'camus-*'

[...]

Running puppet will be enough, everything will be restored at its previous state.

Cool, that's handy. I guess we'll have to disable puppet on an-launcher1002 at the start as well, I'll add this.


Plan to failover and back

  • Check that an-master1001 is active and 1002 is standby:
    • sudo -u hdfs kerberos-run-command hdfs hdfs /usr/bin/hdfs haadmin -getServiceState an-master1001-eqiad-wmnet
    • sudo -u hdfs kerberos-run-command hdfs hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet
  • Failover to 1002 by running the following on 1001:
    • systemctl stop hadoop-hdfs-namenode
    • systemctl stop hadoop-yarn-resourcemanager
    • systemctl stop hadoop-hdfs-zkfc
    • systemctl stop hadoop-mapreduce-historyserver

Updated reimaging plan (just an-master1002 for starters)

Prepare cluster for maintenance (drain cluster, safe mode, snapshot, backup:

  • Disable puppet on an-launcher1002
    • sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
  • Disable jobs on an-launcher1002
    • sudo systemctl stop 'camus-*.timer'
    • sudo systemctl stop 'drop-*.timer'
    • sudo systemctl stop 'hdfs-*.timer'
    • sudo systemctl stop 'mediawiki-*.timer'
    • sudo systemctl stop 'refine_*.timer'
    • sudo systemctl stop 'refinery-*.timer'
    • sudo systemctl stop 'reportupdater-*.timer'
  • Wait 30 minutes for applications to gracefully exit
  • Merge puppet patch to disable yarn queue
  • Disable puppet on an-master1001 and an-master1002
    • sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
  • Kill remaining yarn applications
    • Let @joal do this part :)
  • Enable safe mode
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • Checkpoint
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
  • Create snapshot tar
    • sudo su
    • cd /srv/hadoop/namenode
    • tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
  • Copy snapshot to stat1004
    • (from cumin1001)
    • sudo transfer.py an-master1001.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage (I have created this directory)

-- READY FOR MAINTENANCE --

Check that nothing that would cause the uid script to be unsucessful is running:

  • ps auxf | egrep 'hdfs|yarn|hadoop' should come up empty

Change uids on an-master1002

Reimage an-master1002

  • sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet
    • Will have to confirm that the partitions since we're using reuse-parts-test

-- MAINTENANCE OVER --

Monitor metrics (heap pressure)

Check hadoop logs for any issues

  • /var/log/hadoop-hdfs/
  • /var/log/hadoop-yarn/

Turn off safe mode

  • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave

Delete fsimage snapshot

  • ssh stat1004.eqiad.wmnet "rm -r /home/razzi/hdfs-namenode-fsimage"

Re-enable puppet on an-masters and an-launcher

  • sudo puppet agent --enable

Create patch to re-enable yarn queue (opposite of https://gerrit.wikimedia.org/r/c/operations/puppet/ /692465)

PS: Wow this comment thread is getting huge! I almost feel like it's worth putting the content in version control :P

PS: Wow this comment thread is getting huge!

Put on wiki! Or etherpad! Or edit the task description? :)

One nit, otherwise LGTM: I would stop the timers first, let the cluster drain and finally apply the Yarn patch refresh queues, since IIRC that queue settings prevents jobs from running (so any in flight job may be affected).

It should be ok to merge the yarn patch first; from the hadoop docs:

yarn.scheduler.capacity.<queue-path>.state
The state of the queue. Can be one of RUNNING or STOPPED. If a queue is in STOPPED state, new applications cannot be submitted to itself or any of its child queues. Thus, if the root queue is STOPPED no applications can be submitted to the entire cluster. Existing applications continue to completion, thus the queue can be drained gracefully. Value is specified as Enumeration.

Hmm actually it may need to be yarn.scheduler.capacity.root.state (remove default from yarn.scheduler.capacity.root.default.state)

The main problem in doing the yarn patch first is that new applications submitted IIUC will fail since they will not be accepted by Yarn, causing failures (for example, timers may raise alerts, oozie jobs can fail and need a re-run, etc..). My suggestion is to drain first as we have always done, then apply the patch to prevent any new app that we don't directly control to run.

For the queue name, you have multiple options: if you use root, then all queues will not accept jobs, if you use root.default, only the default queue will stop (generally the one were all users send their interactive jobs).

  • Failover to 1002 by running the following on 1001:
    • systemctl stop hadoop-hdfs-namenode
    • systemctl stop hadoop-yarn-resourcemanager
    • systemctl stop hadoop-hdfs-zkfc
    • systemctl stop hadoop-mapreduce-historyserver

This one needs a little bit of follow up. Multiple daemons are mentioned and not all of them need a restart:

  • The hdfs zkfc is responsible to probe the "local" namenode and report back to Zookeeper, holding a znode for the active. They are there to support the Namenodes, so we shouldn't restart them.
  • There is a failover option for HDFS in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration that is less invasive than restarting if you want to just check the failover (since the Namenode doesn't need to re-read the whole fsimage containing the 50M inodes etc..).
  • The Yarn RMs manage their sessions with zookeeper and one can restart them anytime, no big deal.
  • The history server runs only on 1001, it doesn't support failover so restating it has no "testing" effect basically.

For the HDFS Namenodes: please remember to watch GC timings when restarting the Namenodes, since it takes a bit (minutes) before their bootstrap completes (so it is not great to restart one and then the other one in a close time window).

I was able to failover using sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet, everything seemed to work ok, restarted hadoop-hdfs-namenode on an-master1001, waited a few minutes, failovered back. Everything seems to be working there.

I updated the plan to stop the timers on an-launcher before disabling the yarn queue; should be all set for tomorrow!

Change 692465 merged by Razzi:

[operations/puppet@production] yarn: temporarily stop allowing jobs to be submitted to yarn

https://gerrit.wikimedia.org/r/692465

Change 694588 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] yarn: enable submitting jobs to queue

https://gerrit.wikimedia.org/r/694588

Change 694588 merged by Razzi:

[operations/puppet@production] yarn: enable submitting jobs to queue

https://gerrit.wikimedia.org/r/694588

Change 694597 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] yarn: set queue state to RUNNING

https://gerrit.wikimedia.org/r/694597

Change 694597 merged by Razzi:

[operations/puppet@production] yarn: set queue state to RUNNING

https://gerrit.wikimedia.org/r/694597

Ok, so this didn't go as planned, but there were no lasting issues or data loss. The full logs of the day are here, the relevant part is from 14:37 to 18:04.

Quick summary: saving a new checkpoint caused the active namenode to stop and failover to the standby. After trying this twice, we decided to cancel the upgrade and once an-master1001 became active once again we re-enabled submitting to the yarn queues. I'll comment later with a longer summary of what happened here, and next steps. The checkpoint failure has its own issue: https://phabricator.wikimedia.org/T283733

As @Ottomata commented in https://phabricator.wikimedia.org/T283733#7121008, we're going to try putting the cluster in safe mode again and taking a snapshot to see if the new heap/gc settings make snapshotting work. If the snapshot works, we can proceed with the original plan, reimaging both nodes. If the snapshotting doesn't work, we can get the cluster back to fully operational, then do the upgrade without draining the cluster / safe mode at a later time.

I'm thinking next Thursday Jun 10 before standup would be a good time to drain the cluster, test snapshotting, and potentially upgrade. If that sounds good to @elukey @JAllemandou @Ottomata I'll announce it tomorrow.

Other notes:

  • There were additional timers to stop on an-launcher: drop_event, eventlogging_*.timer, monitor_*.timer
  • I'm glad we tested the failover and were in safe mode, since the snapshotting failure caused the active to failover
  • When re-enabling the yarn queue at the very end, I ran into 3 minor issues:
    • I left out the step to sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues so merging the puppet patch did nothing by itself
    • Just removing the 'yarn.scheduler.capacity.root.state' => 'STOPPED', line didn't do anything, this makes sense because without any setting rendered in the xml it will stay in its current state (stopped)
    • Adding 'yarn.scheduler.capacity.root.state' => 'RUNNING' didn't start the various child queues; they appear to need to be re-enabled one at a time. I disabled puppet and edited the xml settings directly as below because we were already over our announced maintenance window (I gave us 90 minutes last time, this time I'll announce for 180)

sudo -e /etc/hadoop/conf/capacity-scheduler.xml

<property>
  <name>yarn.scheduler.capacity.root.fifo.state</name>
  <value>RUNNING</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.default.state</name>
  <value>RUNNING</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.production.state</name>
  <value>RUNNING</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.essential.state</name>
  <value>RUNNING</value>
</property>

@Ottomata had the idea to make a puppet variable that when set will add RUNNING / STOPPED for all the queues, so we can just toggle a boolean rather than commenting / uncommenting many lines. I'll create a puppet patch to do this before our next attempt.

As @Ottomata commented in https://phabricator.wikimedia.org/T283733#7121008, we're going to try putting the cluster in safe mode again and taking a snapshot to see if the new heap/gc settings make snapshotting work. If the snapshot works, we can proceed with the original plan, reimaging both nodes. If the snapshotting doesn't work, we can get the cluster back to fully operational, then do the upgrade without draining the cluster / safe mode at a later time.

I'm thinking next Thursday Jun 10 before standup would be a good time to drain the cluster, test snapshotting, and potentially upgrade. If that sounds good to @elukey @JAllemandou @Ottomata I'll announce it tomorrow.

I think that we should invest more time in understanding why saveNamespace doesn't work in the subtask, since it is a critical functionality, and set aside the upgrade until we have a solid idea about how to solve the problem. As indicated in the task, I am not sure if the new settings will solve the problem, most probably we'll need to apply other patches like:

  • increase the service handler threads
  • increase the zkfc timeout failover (from 45s to something else?)

There may also be an upstream bug (already fixed/reported or not) to work on, or other tests to do. I am ok in setting up maintenance to attempt a saveNamespace, but I don't think we should upgrade until we have fixed this problem. Changing the OS will likely not have any consequence, but it is another variable that we should keep in mind when debugging, so I'd avoid to change the current status before reaching some solid result. There is really no rush to upgrade to Buster if we have a problem to solve, SRE will totally understand it.

Other notes:

  • There were additional timers to stop on an-launcher: drop_event, eventlogging_*.timer, monitor_*.timer
  • I'm glad we tested the failover and were in safe mode, since the snapshotting failure caused the active to failover
  • When re-enabling the yarn queue at the very end, I ran into 3 minor issues:
    • I left out the step to sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues so merging the puppet patch did nothing by itself
    • Just removing the 'yarn.scheduler.capacity.root.state' => 'STOPPED', line didn't do anything, this makes sense because without any setting rendered in the xml it will stay in its current state (stopped)
    • Adding 'yarn.scheduler.capacity.root.state' => 'RUNNING' didn't start the various child queues; they appear to need to be re-enabled one at a time. I disabled puppet and edited the xml settings directly as below because we were already over our announced maintenance window (I gave us 90 minutes last time, this time I'll announce for 180)

sudo -e /etc/hadoop/conf/capacity-scheduler.xml

<property>
  <name>yarn.scheduler.capacity.root.fifo.state</name>
  <value>RUNNING</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.default.state</name>
  <value>RUNNING</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.production.state</name>
  <value>RUNNING</value>
</property>
<property>
  <name>yarn.scheduler.capacity.root.essential.state</name>
  <value>RUNNING</value>
</property>

@Ottomata had the idea to make a puppet variable that when set will add RUNNING / STOPPED for all the queues, so we can just toggle a boolean rather than commenting / uncommenting many lines. I'll create a puppet patch to do this before our next attempt.

Seems good to me, but it may limit our capability to stop/target some specific queues (say if we wanted to stop only the default one and not the others). I'd personally just add the RUNNING states that you pointed out above (nice catch), and flip those values when needed (we'll have to use a puppet change anyway to change the queue statuses). Anyway, I am ok with any idea, no blockers, just remember to document this in https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration so that we are aware about the procedures :)

I am ok in setting up maintenance to attempt a saveNamespace, but I don't think we should upgrade until we have fixed this problem.

I think the intention of the downtime Razzi wants to schedule is to solve the saveNamespace problem. If there is more work to be done to figure out the problem now, then we should do that work before we schedule the downtime. But, it seemed impossible to know if any fixes we might attempt actually work without trying it.

If the fixes work, then the upgrade can proceed, if they don't, the entire downtime would be used to debug the saveNamespace problem.

@elukey, my understanding of the previous upgrade plan with downtime saveNamespace was that it was more for practice than absolutely necessary. We should be able to do an online upgrade, right? Or, are you just saying that we should solve the saveNamespace problem before we attempt any upgrade at all, for safety reasons?

I am ok in setting up maintenance to attempt a saveNamespace, but I don't think we should upgrade until we have fixed this problem.

I think the intention of the downtime Razzi wants to schedule is to solve the saveNamespace problem. If there is more work to be done to figure out the problem now, then we should do that work before we schedule the downtime. But, it seemed impossible to know if any fixes we might attempt actually work without trying it.

Makes sense, what I wanted to say is that the bump of the heap was one thing that I did that may help, but there are other things that we should consider (all written down in the task) that require some dive deep for sure.

If the fixes work, then the upgrade can proceed, if they don't, the entire downtime would be used to debug the saveNamespace problem.

@elukey, my understanding of the previous upgrade plan with downtime saveNamespace was that it was more for practice than absolutely necessary. We should be able to do an online upgrade, right? Or, are you just saying that we should solve the saveNamespace problem before we attempt any upgrade at all, for safety reaons?

The saveNamespace action is not strictly necessary, but it is the only way to have a clean fsimage right after the safe mode (to avoid fsimage edit log etc..). If we reimage 1002 and while at it 1001 goes down (maybe badly), we may risk something (say 1002 reimage goes not well, we loose fsimages under /srv, etc..). We backup fsimages too but not up to the safe-mode-enter step, so we cannot say that we wouldn't risk loosing any data. I know that these are very unfortunate use cases, but better safe than sorry :D

We can think about reimaging 1002 while copying all the files namenode's directory to another host, that would work too, but it feels cleaner/safer/more-consistent after saveNamespace.

Moreover the fact that saveNamespace doesn't work smell as if something is not set up correctly, or that we are seeing a bug, since it is such a fundamental step (and hourly checkpointing seems to work, very weird). This is why I am suggesting to stop changing things until this basic operation works again, just to avoid changing other variables (even if we are almost sure that changing os will not affect the Namenode).

I am suggesting to stop changing things until this basic operation works again, just to avoid changing other variables

Ok, makes sense

The saveNamespace action is not strictly necessary, but it is the only way to have a clean fsimage right after the safe mode (to avoid fsimage edit log etc..)

But is even entering safe mode necessary? We can just reimage the standby node while the active continues to serve live traffic. I believe this should be possible even without stopping jobs. Yes, if we lose the active namenode while the standby is being reimaged, we'd lose some hdfs edits, but isn't one of the reasons for having a hot failover namenode to be able to do things like this?

Not suggesting that we do ^ (resolving saveNamespace problem first makes sense), just making sure I understand.

Yes yes drain safe mode save namespace are totally optional, we could simply reimage an-master1002 (preserving /srv) and restarting it right after the reimage (it will restart like nothing happened in theory).

Ok, new plan as we discussed at ops sync is to try the upgrade again next week - I'm picking Tuesday June 15. We'll see if the new memory threads settings fix the snapshot issue, and if they do, we'll proceed with the upgrade. If not, we'll get the cluster back up and running and exit safe mode.

Mentioned in SAL (#wikimedia-analytics) [2021-06-15T14:35:32Z] <razzi> disable jobs that use hadoop on an-launcher1002 following https://phabricator.wikimedia.org/T278423#7094641

Change 699943 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] yarn: temporarily stop queues

https://gerrit.wikimedia.org/r/699943

Change 699943 merged by Razzi:

[operations/puppet@production] yarn: temporarily stop queues

https://gerrit.wikimedia.org/r/699943

Icinga downtime set by razzi@cumin1001 for 60 days, 0:00:00 1 host(s) and their services with reason: Update operating system to bullseye

an-master1002.eqiad.wmnet

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

an-master1002.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202106151655_razzi_16131_an-master1002_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-analytics) [2021-06-15T16:55:24Z] <razzi> sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet

Completed auto-reimage of hosts:

['an-master1002.eqiad.wmnet']

and were ALL successful.

Change 699955 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] yarn: re-enable queues

https://gerrit.wikimedia.org/r/699955

Change 699955 merged by Razzi:

[operations/puppet@production] yarn: re-enable queues

https://gerrit.wikimedia.org/r/699955

savenamespace worked 🎉 and an-master1002 is running Buster!

Next steps:

  • later this week, failover to 1002 and make sure it can operate as active for a few days
  • schedule downtime to reimage 1001 in a week or so and repeat

As follow up we should open a task to aggregate alarms related to hadoop workers, since as we have seen yesterday if the masters are unreachable (even for maintenance) we get a shower of alerts that it is not helpful in debugging.

Didnt' end up doing the failover last week since it was all hands; I think this can be done whenever, @elukey how do you feel about me doing the failover in the next couple of days and leaving it failed over for a couple days?

an-master1002 is active; assuming nothing goes wrong, we'll keep it active for a couple days so we're confident it's safe to reimage 1001, then failover back and make a plan for the final reimage

Reimaging plan for an-master1001

Prepare cluster for maintenance (drain cluster, safe mode, snapshot, backup:

  • Disable puppet on an-launcher1002
    • sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
  • Disable jobs on an-launcher1002
    • sudo systemctl stop 'camus-*.timer'
    • sudo systemctl stop 'drop-*.timer'
    • sudo systemctl stop 'hdfs-*.timer'
    • sudo systemctl stop 'mediawiki-*.timer'
    • sudo systemctl stop 'refine_*.timer'
    • sudo systemctl stop 'refinery-*.timer'
    • sudo systemctl stop 'reportupdater-*.timer'
    • sudo systemctl stop drop_event
    • sudo systemctl stop 'eventlogging_*.timer'
    • sudo systemctl stop 'monitor_*.timer'
  • Wait 30 minutes for applications to gracefully exit
  • Merge puppet patch to disable yarn queue
  • Disable puppet on an-master1001 and an-master1002
    • sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster'
  • Kill remaining yarn applications
    • for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done
  • Enable safe mode
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter
  • Checkpoint
    • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace
  • Create snapshot tar
    • sudo su
    • cd /srv/hadoop/name
    • tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current
  • Copy snapshot to stat1004
    • (from cumin1001)
    • sudo transfer.py an-master1001.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage (I have created this directory)

-- READY FOR MAINTENANCE --
Downtime an-master1001 on icinga: sudo icinga-downtime -h an-master1001 -d 7200 -r "an-master1001 debian upgrade"

Stop hadoop processes on an-master1001

systemctl stop hadoop-hdfs-namenode
systemctl stop hadoop-yarn-resourcemanager
systemctl stop hadoop-hdfs-zkfc
systemctl stop hadoop-mapreduce-historyserver

Check that nothing that would cause the uid script to be unsucessful is running:

  • ps auxf | egrep 'hdfs|yarn|hadoop' should come up empty

Check if yarn and hdfs are active on an-master1002

Change uids on an-master1001

Reimage an-master1001

  • sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet
    • Will have to confirm that the partitions since we're using reuse-parts-test

-- MAINTENANCE OVER --

Monitor metrics (heap pressure)

Check hadoop logs for any issues

  • /var/log/hadoop-hdfs/
  • /var/log/hadoop-yarn/

Turn off safe mode

  • make and merge puppet patch to re-enable yarn
  • apply the change with sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues
  • sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave

Delete fsimage snapshot

  • ssh stat1004.eqiad.wmnet "rm -r /home/razzi/hdfs-namenode-fsimage"

Re-enable puppet on an-masters and an-launcher

  • sudo puppet agent --enable

Happy July! Sqoop is running so maintenance is rescheduled to the week after next.

Change 705698 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] yarn: disable accepting jobs to queues

https://gerrit.wikimedia.org/r/705698

Change 705698 merged by Razzi:

[operations/puppet@production] yarn: disable accepting jobs to queues

https://gerrit.wikimedia.org/r/705698

Script wmf-auto-reimage was launched by razzi on cumin1001.eqiad.wmnet for hosts:

an-master1001.eqiad.wmnet

The log can be found in /var/log/wmf-auto-reimage/202107201726_razzi_21849_an-master1001_eqiad_wmnet.log.

Mentioned in SAL (#wikimedia-analytics) [2021-07-20T17:27:06Z] <razzi> razzi@cumin1001:~$ sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet

Completed auto-reimage of hosts:

['an-master1001.eqiad.wmnet']

and were ALL successful.

Change 705732 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] yarn: re-enable queues

https://gerrit.wikimedia.org/r/705732

Change 705732 merged by Razzi:

[operations/puppet@production] yarn: re-enable queues

https://gerrit.wikimedia.org/r/705732

Successful second reimage. Both hosts run Buster now, and an-master1001 is back to active!

Follow up: change partman to remove -test config, no need to manually confirm the partitions every time since there was no complication

Change 705782 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] netboot: make an-masters reimage without confirmation

https://gerrit.wikimedia.org/r/705782

@JAllemandou could you explain what happened with safe mode and the yarn rmadmin? Maybe put a small comment here and then we can create a follow-up ticket.

Finally this task is done! Mar - July 2021 🥲

First of all, great work :)

The problem should be the following:

== Yarn view:

2021-07-20 16:36:32,071 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active
        at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:146)
        at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:894)
        at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:473)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:610)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508)
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode
        at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:325)
        at org.apache.hadoop.yarn.server.resourcemanager.ActiveStandbyElectorBasedElectorService.becomeActive(ActiveStandbyElectorBasedElectorService.java:144)
        ... 4 more
Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/yarn/node-labels. Name node is in safe mode.
It was turned on manually. Use "hdfs dfsadmin -safemode leave" to turn safe mode off. NamenodeHostName:an-master1002.eqiad.wmnet

== HDFS view:

elukey@an-master1002:~$ grep -rni label /var/log/hadoop-hdfs/hadoop-hdfs-namenode-an-master1002.log
537042:2021-07-20 16:36:31,971 INFO org.apache.hadoop.ipc.Server: IPC Server handler 24 on default port 8020, call Call#0 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 10.64.21.110:38043: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/yarn/node-labels. Name node is in safe mode.
537380:2021-07-20 16:53:01,474 INFO org.apache.hadoop.ipc.Server: IPC Server handler 40 on default port 8020, call Call#1 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 10.64.21.110:43999: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/yarn/node-labels. Name node is in safe mode.
537438:2021-07-20 16:54:36,224 INFO org.apache.hadoop.ipc.Server: IPC Server handler 84 on default port 8020, call Call#0 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 10.64.5.26:42471: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/yarn/node-labels. Name node is in safe mode.
537441:2021-07-20 16:54:45,216 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on default port 8020, call Call#2 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 10.64.21.110:39135: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/yarn/node-labels. Name node is in safe mode.
537448:2021-07-20 16:54:53,528 INFO org.apache.hadoop.ipc.Server: IPC Server handler 104 on default port 8020, call Call#1 Retry#1 org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 10.64.5.26:40823: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create directory /user/yarn/node-labels. Name node is in safe mode.
537511:2021-07-20 16:55:02,531 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1768217052_694555443, replicas=10.64.21.114:50010, 10.64.5.25:50010, 10.64.5.24:50010 for /user/yarn/node-labels/nodelabel.mirror.writing

When me and Joseph set up the capacity scheduler, we also added support for Yarn node labels (in order to tag GPU nodes). The setting that seemed the best compromise at the moment was a HDFS directory used by the Yarn Resource Managers to share a state about node labels, namely /user/yarn/node-labels. The main issue is that the Yarn RM doesn't like safe mode (since it writes periodically to that dir as far as I can see), so we'll need to find another solution.

In https://hadoop.apache.org/docs/r2.10.1/hadoop-yarn/hadoop-yarn-site/NodeLabel.html I can see the following for yarn.node-labels.fs-store.root-dir :

If user want to store node label to local file system of RM (instead of HDFS), paths like file:///home/yarn/node-label can be used

This should be a viable path, but it needs to be tested on Hadoop test before.

I have also created T287027 to follow up on alarming, aggregate alerts are surely better for our use case.

I was going to mention the small snag that an-master1001 prompted during the installation for which partman recipe to use.

...but then I found out that it was configured to wait for a human operator to check using the serial console the correct re-use of existing partitions.
https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/install_server/files/autoinstall/netboot.cfg$64

So that's OK. It would have been nice if it had skipped the first menu of:

  • Guided - Use entire disk
  • Guided - Use entire disk and set up LVM
  • Guided - Use entire disk and set up encrypted LVM
  • Manual ( <-- This is the one that had to be selected )

...and waited at the "Are you sure you want to commit the changes shown above?" prompt instead.

However, I didn't fancy going any further down the d-i/partman rabbit hole just yet.

@BTullis the partman reuse script that we use auto-selects everything in d-i, the operator has only to confirm the partitions layout (namely what partitions to keep/format etc.., but they are all pre-selected). I joined the batcave with Razzi and I think that he may have pressed "Enter" right after joining the serial console due to a blank screen, hence ending up in Guided etc.. When we went back to the previous tab everything was preselected correctly (as it happened with an-master1002).

After https://gerrit.wikimedia.org/r/705782 the manual confirmation will not be needed anymore (so d-i will preserve the partitions requested and auto-confirm).

Oh I see. So the "Enter" might just have been interpreted as "Return to partman menu".
In that case, everything would be fine, as you mention. 👍

Change 705782 merged by Razzi:

[operations/puppet@production] netboot: make an-masters reimage without confirmation

https://gerrit.wikimedia.org/r/705782