Page MenuHomePhabricator

write migration plan for mailman
Closed, ResolvedPublic

Description

write migration plan for the actual switch away from sodium to fermium after fermium has been reinstalled

https://etherpad.wikimedia.org/p/mailman-migration


#taskticketdone?
1request new VM for staging/testingT108065yes
2install jessie on new VMT108070yes
3let JohnLewis sign L2T108057yes
3give JohnLewis shell access on new VM and sudo to execute things as "list" and view log filesT108082yes
4basic semi-manual mailman 2.1.8 setup on new VMT108383yes
5setup rsyncd on fermium (via puppet) to be able to copy files directly without agent forwardingT109921yes
6export list configs and archives from sodium, rsync them all over to fermiumT108071yes
7write script to import listsT109922yes
8test importing of list configs and archives on fermium for all lists (public and private)T108073yes
9rename lists with invalid namesT109539, T109393yes
10move hardcoded IP configuration (server and service name) to hiera to be able to run more than 1 mailman instance from puppet roleT109624yes
11clean up mailman data directory on sodium (over 0.5 million held messages)T109838, T83967yes
12write this plan :)T109467in progress
13go through all directories in /var/lib/mailman and decide whether they need to be imported or can be skippedT109399yes
14figure out which new service IP to use, v4 and v6, set it in hiera?T108080yes
15add public IP for fermium (DNS change, installserver/DHCP change)T109923yes
16reinstall OS (jessie) on fermiumT109924yes
17apply regular mailman role on fermiumT109925no
18test ferm rules are sufficientT104980no
19rsync all configs and archives one more timeT110129no
20import all lists with the script we wrote for thatT110131no
21one day before: lower lists.wikimedia.org TTL to 5 minT110132no
22announce scheduled downtime - need to debate and decide on a worst-case length.T110133no
23right before the switch: lower TTL to 10 secondsT110135no
24hold lists.wikimedia.org with exim (disable puppet on sodium; apply locally rather via operations/puppet unless we want to hold all emails to fermium as well for 'safety'?)T110136no
25shut down mailman on sodiumT110137no
26rsync one more time, this time only the diff since it was shutdownT110138no
27rsync exim spool directoryT110440no
28test sending individual mails from fermiumT110441no
29switch over service IPT110139no
30send follow-up email, announce changes with new mailman version if any that have user impact ?T110140no
31profit? maybe - revert ideas for worst cases?no
32TTL back up to normal 1HT110141no
33shutdown sodium, celebrate "no more lucid", close all resolved ticketsT110142no

Related Objects

Event Timeline

Dzahn claimed this task.
Dzahn raised the priority of this task from to High.
Dzahn updated the task description. (Show Details)
Dzahn added subscribers: ori, MZMcBride, Dzahn and 5 others.
Dzahn renamed this task from write migration plan for mailmanw to write migration plan for mailman.Aug 18 2015, 4:34 PM
Dzahn set Security to None.

migration plan for mailman

objectives:

  • move away from server sodium (lucid) to server fermium (jessie) to get rid of the last lucid box in all of WMF
  • upgrade mailman from 2.1.13 to 2.1.18

tracking ticket: https://phabricator.wikimedia.org/T105756

#taskticketdone?
1request new VM for staging/testingT108065yes
2install jessie on new VMT108070yes
3let JohnLewis sign L2T108057yes
3give JohnLewis shell access on new VM and sudo to execute things as "list" and view log filesT108082yes
4basic semi-manual mailman 2.1.8 setup on new VMT108383yes
5setup rsyncd on fermium (via puppet) to be able to copy files directly without agent forwardingT109921yes
6export list configs and archives from sodium, rsync them all over to fermiumT108071yes
7write script to import listsT109922yes
8test importing of list configs and archives on fermium for all lists (public and private)T108073yes
9rename lists with invalid namesT109539, T109393yes
10move hardcoded IP configuration (server and service name) to hiera to be able to run more than 1 mailman instance from puppet roleT109624yes
11clean up mailman data directory on sodium (over 0.5 million held messages)T109838, T83967yes
12write this plan :)T109467in progress
13go through all directories in /var/lib/mailman and decide whether they need to be imported or can be skippedT109399yes
14figure out which new service IP to use, v4 and v6, set it in hiera?T108080no
15add public IP for fermium (DNS change, installserver/DHCP change)T109923no
16reinstall OS (jessie) on fermiumT109924no
17apply regular mailman role on fermiumT109925no
18test ferm rules are sufficientT104980no
19rsync all configs and archives one more timeT110129no
20import all lists with the script we wrote for thatT110131no
21one day before: lower lists.wikimedia.org TTL to 5 minT110132no
22announce scheduled downtime - need to debate and decide on a worst-case length.T110133no
23right before the switch: lower TTL to 10 secondsT110135no
24hold lists.wikimedia.org with exim (disable puppet on sodium; apply locally rather via operations/puppet unless we want to hold all emails to fermium as well for 'safety'?)T110136no
25shut down mailman on sodiumT110137no
26rsync one more time, this time only the diff since it was shutdownT110138no
27switch over service IPT110139no
28send follow-up email, announce changes with new mailman version if any that have user impact ?T110140no
29profit? maybe - revert ideas for worst cases?no
30TTL back up to normal 1HT110141no
31shutdown sodium, celebrate "no more lucid", close all resolved ticketsT110142no

Problems a migration will have:

  • sodium and fermium would both be acting as mailman servers for lists.wikimedia.org -- solved by stopping mailman and holding exim
  • small window of latency for updates/risk of losing changes or actions if performed on sodium while we call fermium 'active'
  • emails lost by queued exim in migration -- move queued emails to fermium and hope exim can pick them up there

Tickets that will be solved:
T80945 (get rid of all lucid installs)
T80962 (upgrade mailman version)
T82698 (shutdown sodium)
T66818 (strict DMARC policy) - mailman settings mostly, solved in 2.1.16
T27231 (Mailman mailing list archiver truncates if a line begins with "From") solved by new version, right? fixed in 2.1.16
T90351 (improve SSL config), solved by Apache 2.4 upgrade, already checked
T83967 (cleanup held messages)
T109838 (cleanup held messages 2)
T104980 (ferm rules, they exist but have not been applied on prod., existing manual rules on sodium don't have to be copied)

Tickets that will be unblocked but not solved yet (on purpose, one step at a time):

T82576, T109239 (enabling STARTTLS)

Since I cant edit comments, 14 is done and depends on 15. Since nothing else depends on fermium existence as is and other work is blocked, I'm going to give Alex an okay go move fermium from private to public spaces.

Added the table to the description.

items from meeting:

"double check for mailman cronjobs"
"how long does rsync take" (for the final run, both)
"where does mailman store listinfo info"
"also tell ops list"
"stop exim, copy spool directories"
"exim commands / cheat sheet: show mail queue, send individual mail from queue, then multiple"

qfiles; this can be handled two ways. We could stop mailman with exim and rsync them or we can hold exim and let mailman run for a defined period of time (10 minutes to be safe) before we stop it. This would mean we don't actually have to rsync the qfiles at all and any held mail like pending and so on would also be cleaned up which makes an rsync easier, cleaner and reduces chance of duplicating out-going emails.

"how long does rsync take" -> an hour or a little bit more, depending on how long we wait between rsync runs. actually tested this though, lowered time a bit by deleting a bunch of crap from qfiles/bad and shunt files older than 7 days

"double check cronjobs" -> T110382

"where does store listinfo info" -> in each lists config.pck and we run "withlist" to update it after import but this is only necessary when URL actually changes

"also tell ops" -> added on T110133
"stop exim, copy spool directories" -> T110440, T110136
"exim commands cheat sheet" -> T110441

resolving, all action items are done or have a follow-up ticket linked