Page MenuHomePhabricator

Beta puppetmaster cherry-pick process
Open, MediumPublic

Description

In order to increase the maintainability of the beta puppetmaster, and reduce the delta from production puppetmaster we need a process for handling puppet changes that are being tested on beta.

The current process in which deployment-prep admins simply cherry-pick a patch to the puppetmaster is unsustainable. The process leaves uncollected commits littered in beta, or, worse prevents beta from pulling in production updates.

Ideally, we could use this ticket to define a process that is:

  1. More visible than the current process
  2. More accountable than the current process
  3. Adds very little overhead to the current process
  4. Defines a clear set of criteria for when/how a cherry pick can be made on beta puppet master
  5. Defines a clear set of criteria for when/how a cherry pick can be removed without the involvement of patch authors necessarily
  6. Timeboxed so we don't have cherry picks to beta puppetmaster living on for many months.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

* 4849f99 2015-10-20 (7 months ago) root adds self-signed certificate, which was needed to keep https working (ish) in beta until T50501: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org is done
I don't know what * 9a6f4f6 2016-03-05 (2 months ago) root is and it has no name or task. It adds not-already-defined checks around labs_lvm::volume in modules/role/manifests/labs/lvm/mnt.pp and modules/role/manifests/labs/lvm/srv.pp

* b0cd9dd 2016-02-11 (3 months ago) Timo Tijhof - this got merged in https://gerrit.wikimedia.org/r/#/c/270009/ and is now just adding a single line. I got rid of it.

It now looks like this:

root@deployment-puppetmaster:/var/lib/git/operations/puppet# git log --pretty=format:"%h � (%ar) %an - %s" --graph --date=short origin/production..HEAD
* fe0411e 2016-04-18 (4 weeks ago) Eric Evans - [WIP]: Cassandra 2.2.5 config
* a221853 2015-12-14 (5 months ago) Giuseppe Lavagetto - mediawiki: add conftool-specifc credentials and scripts
* df5862d 2016-04-21 (4 weeks ago) Mukunda Modell - Fix race in puppet::self (puppet.conf compilation)
* 97c86f6 2016-03-09 (10 weeks ago) Antoine Musso - hiera_lookup: recognize labs project and site
* 92a931c 2015-10-28 (7 months ago) Antoine Musso - cache: vary statsd_server with hiera
* 32863c1 2016-03-05 (2 months ago) root - wrap labs_lvm::mnt in if (defined) conditional
* b08e6b6 2016-01-27 (4 months ago) Antoine Musso - sysfs: puppet always restarted the sysfsutils service
* 0270bc2 2016-01-07 (4 months ago) Gergő Tisza - logstash: add sentry output plugin
* 6150e71 2016-01-06 (4 months ago) Gergő Tisza - Add output plugin for Sentry
* 1d62344 2015-06-12 (11 months ago) Moritz Mühlenhoff - Prevent access to hidden directories
* 5a264cd 2015-10-20 (7 months ago) root - [LOCAL HACK] Change star.wmflabs.org to beta certificate

Related tasks for those commits:

We should allow patches to be put on the puppetmaster to test them temporarily (say... for each version up to a week?), and patches to make beta behave more like production indefinitely.

We should allow patches to be put on the puppetmaster to test them temporarily (say... for each version up to a week?), and patches to make beta behave more like production indefinitely.

There definitely seem to be at least 2 classes of patches here that seem to map fairly well to the Beta-Cluster-Infrastructure and Beta-Cluster-reproducible projects.

It's fine, probably, if some of the infrastructure patches are blocked, but that should be documented somewhere (commit message or wikipage).

I would still like non-blocked patches of both types to be timeboxed via some kind of a dead man's switch. That is to say, it'd be nice if all the patches that weren't documented as blocked went away after 2 weeks or a month and if the owner really wanted to keep the cherry-pick or commit around some action would have to be taken.

This task is about coming up with a new process, not just cleaning up the current list of patches.

It's not always so straightforward. Sometimes the cherry picked patches have interdependencies, don't rebase cleanly, etc.

Even worse, people can easily trample eachothers' patches accidentally when doing git rebase -i on the puppetmaster.

As of today we have 13 patches cherry picked to beta of various ages by various authors:

thcipriani@deployment-puppetmaster:/var/lib/git/operations/puppet$ git status
On branch production
Your branch is ahead of 'origin/production' by 13 commits.
  (use "git push" to publish your local commits)

nothing to commit, working directory clean

thcipriani@deployment-puppetmaster:/var/lib/git/operations/puppet$ git log --pretty=format:"%h � (%ar) %an" --graph --date=short -13
* 969cd78 2016-04-18 (4 weeks ago) Eric Evans

[ ... ]

Can we start by cleaning/reviewing some of these?

This one should be getting merged RSN (today? tomorrow?). Removing it would downgrade the Cassandra configuration, breaking the cluster (forcing us to wipe and start-over). We could deal with this if it were necessary, but it wouldn't be awesome.

This one should be getting merged RSN (today? tomorrow?). Removing it would downgrade the Cassandra configuration, breaking the cluster (forcing us to wipe and start-over). We could deal with this if it were necessary, but it wouldn't be awesome.

No problem keeping it around. I was just hoping to get a better idea of what's currently cherry-picked and why so we can come up with a process for cherry picks and the maintenance of cherry-picks.

This one should be getting merged RSN (today? tomorrow?). Removing it would downgrade the Cassandra configuration, breaking the cluster (forcing us to wipe and start-over). We could deal with this if it were necessary, but it wouldn't be awesome.

No problem keeping it around. I was just hoping to get a better idea of what's currently cherry-picked and why so we can come up with a process for cherry picks and the maintenance of cherry-picks.

In this case, we entered this sort point-of-no-return the moment we cherry-picked the changeset and proceeded with the Cassandra upgrade; Anything other than leaving it in place until the changeset was merged was always going to mean forcibly downgrading (a risk I accepted at the time).

Any policy/process moving forward that put a hard upper-bound on the amount of time a cherry-picked changeset could be applied would probably run afoul of cases like this one.

Any policy/process moving forward that put a hard upper-bound on the amount of time a cherry-picked changeset could be applied would probably run afoul of cases like this one.

This is one of the critical points we need to pay attention to. It can be slow & difficult to roll out major changes to production. We need to make the transition as painless as possible. At the same time we are trying to make beta-cluster closely match production and remain stable for testing other changes.

I don't think beta cluster can be all things for all people but there is a lot of room for improvement. I think we need to explore the use-cases that beta cluster is expected to support and formalize processes that balance these needs appropriately.

hmm...why does it feel like I just said a whole lot of nothing?

So deployment-prep is a fairly poor simulation of a production wmf cluster and we have a lot of different kinds of testing going on simultaneously:

  • Automated continuous integration tests
    • I believe that this is a primary and critical role that should not be allowed to break due to other types of testing. Correct me if I'm wrong here.
  • Staging mediawiki changes before they go to production
  • Testing puppet changes before they go to production
  • Prototyping large infrastructure changes and new service deployments

Some of these uses are in conflict and we don't currently have any viable alternatives to fill these needs.

@Eevans Playing a maintenance on beta before rolling it on production is a perfectly legitimate use case and the whole point of beta cluster.

Limiting the amount of time a cherry pick or present or arbitrary drop will surely cause havoc. It does already whenever the auto rebase breaks.


My experience

About a better process: as part of my daily grind, almost every morning I would check for the Shinken monitoring ( http://shinken.wmflabs.org/problems ) and ninja fix puppet.git to adjust it for beta or the CI infrastructure (which also have its own puppet master).

Since the problem got fixed, there is little incentive for me to hunt for a reviewer from ops to get it merged. Though when it also fix a production issue I would reach out to the appropriate ops having the area of expertise.

In some other case, the ninja fix needs more work before asking for review. It fix the immediate problem but is a very rough patch that needs polishing. I eventually forget about the task/commit :-/


Lets merge!

Something that has been very successful are the Puppet SWAT windows. They are one hour deployment windows lead by the operations team. The process is straightforward:

From there, ops will have high confident it can land and be pushed to prod. Else they will skip it since Puppet SWAT patches must be straight forward / low impact.

For the couple or so time I have used puppet SWAT, I got batches of changes merged in just a dozen of minutes.

* 5cde85b 2016-01-07 (4 months ago) Gergő Tisza
* 3eb89da 2016-01-06 (4 months ago) Gergő Tisza

Those can be dropped; I added them for testing but haven't had time to work on that project for quite a while now.
Sorry for the noise.

We could easily implement patch expiration in a way which would strike a balance between the need for sanity and the need for longer lived patches:

I propose that the expiration of patches should be controlled by a TTL: field in the commit message. This way the author of the patch can have some control over how long the patch will remain. It could default to a short life if the field isn't present, but honor arbitrarily long TTL times if the commit has a one specified. We could have a maximum TTL or we could support TTL: forever if that seems useful.

We could easily implement patch expiration in a way which would strike a balance between the need for sanity and the need for longer lived patches:

I propose that the expiration of patches should be controlled by a TTL: field in the commit message. This way the author of the patch can have some control over how long the patch will remain. It could default to a short life if the field isn't present, but honor arbitrarily long TTL times if the commit has a one specified. We could have a maximum TTL or we could support TTL: forever if that seems useful.

Cherry pick I am doing are meant to stick around until they are merged on beta. I dont really make short lived / one time cherry picks.

I think it all boils down to the patch author working with SRE to have the patch polished and merged in puppet.git.

Here is the current state on the beta cluster puppetmaster:
git log --format='| %an | %s' HEAD@{upstream}..

Andrew OttoHieraize eventlogging_kafka_handler to allow selection of different kafka clients
Filippo Giunchedipuppetization for thumbor
Brandon Blacksslcert: regenerate dhparam.pem
Filippo Giunchedipuppetization for thumbor
Tyler CiprianiBeta: Add logstash host to scap.cfg
Tyler CiprianiBeta: Scap canary deploy dsh groups
Bryan Davislogstash: Parse nginx access logs for wdqs
Kartik MistryBeta: Fix cxserver restbase_url
amirores: Scap3 deployment configurations
Mukunda ModellFix race in puppet::self (puppet.conf compilation)
Antoine Mussocache: vary statsd_server with hiera
rootwrap labs_lvm::mnt in if (defined) conditional
Gergő Tiszalogstash: add sentry output plugin
Gergő TiszaAdd output plugin for Sentry
thcipriani lowered the priority of this task from High to Medium.Aug 23 2016, 5:01 PM

Lowing priority from high since nothing is happened here in a while.

Here's what's currently cherry picked:

Bryan Davislogstash: Tag Striker messages for indexing
Alex MonkRemove the hard-coded /a/mw-log references scattered around everywhere
Alex Monkdeployment-prep: Move udp2log to deployment-fluorine02
Marko ObrovacPDF Render Service: Role and module
Tyler Ciprianiscap: bump version to 3.2.3-1
Alex Monkbeta: Use Let's Encrypt cert
Tyler CiprianiBeta: Add logstash port
Bryan Davislogstash: Parse nginx access logs for wdqs
Brandon Blacksslcert: regenerate dhparam.pem
Mukunda ModellFix race in puppet::self (puppet.conf compilation)
Antoine Mussocache: vary statsd_server with hiera
Gergő Tiszalogstash: add sentry output plugin
Gergő TiszaAdd output plugin for Sentry
root[LOCAL HACK] wrap labs_lvm::mnt in if (defined) conditional

Most of this stuff seems like it will merge eventually.

Random proposal for moving this task forward, we could create a script that runs on $interval that does:

  1. Ensure every cherry-picked patch has a 'Bug: TXXXXXX'
  2. Check that task:
    • Closed? Remove cherry-pick
    • Doesn't have tag $cherryPickTag: Remove cherry-pick
  3. Poke the task by leaving a message as $systemUser user on $interval: "patch is still cherry picked on beta"

This would remove some of the current cherry picks, but it would make thing more accountable.

Here's what's currently cherry picked:

Tyler Ciprianiscap: bump version to 3.2.3-1

merged

Brandon Blacksslcert: regenerate dhparam.pem

already merged upstream, removed.

Gergő TiszaAdd output plugin for Sentry

Was for a completely different repo. removed.

root[LOCAL HACK] wrap labs_lvm::mnt in if (defined) conditional

Does anyone know what this one was needed for? I added the [LOCAL HACK] tag to it today when I found that it did not have a gerrit Change-Id. The commit message unhelpfully does not mention which roles were found to be in conflict for Labs_lvm::Volume['second-local-disk']. Judging by the patch contents it looks like something was applying both role::labs::lvm::srv and role::labs::lvm::mnt on the same host.

Now we are down to these:

@PcheloloChange-Prop: Enable file transclusion updates
@AlexMonk-WMFRemove the hard-coded /a/mw-log references scattered around everywhere
@mobrovacPDF Render Service: Role and module
@AlexMonk-WMFbeta: Use Let's Encrypt cert
@bd808logstash: Parse nginx access logs for wdqs
@hasharcache: vary statsd_server with hiera
@Tgrlogstash: add sentry output plugin
@PcheloloChange-Prop: Enable file transclusion updates

Pre-prod testing, should be merged early next week (not going to do it right before the week-end, as it has disruption potential)

@mobrovacPDF Render Service: Role and module

This is a new service that should enter production soon(TM). For the time being, we are testing it in Beta. The associated ticket is T143129: New service request - PDF Render.

Thanks @mobrovac. I think we are in pretty good shape.

@hasharcache: vary statsd_server with hiera

That is for T116898: On beta cluster varnish stats process points to production statsd eg have the Varnnish stats reported to the labs statsd instead of the production one. Still have to change the keyprefix. The patch is https://gerrit.wikimedia.org/r/#/c/249490/

root@deployment-puppetmaster:/var/lib/git/operations/puppet# git log --oneline origin/production..HEAD
d1d3613 wmflib: Return default dir when role::puppet::self isn't used
9fff4c3 contint: fix resource conflict with service::deploy::common
e8c0199 PDF Render Service: Role and module
5656c25 logstash: Parse nginx access logs for wdqs
0b5a2a4 logstash: add sentry output plugin

The first two are due to T120159 and T143065, the PDF render service one is addressed above

  1. Ensure every cherry-picked patch has a 'Bug: TXXXXXX'
  2. Check that task:
    • Closed? Remove cherry-pick
    • Doesn't have tag $cherryPickTag: Remove cherry-pick
  3. Poke the task by leaving a message as $systemUser user on $interval: "patch is still cherry picked on beta"

As I've been working through this and refining I've realized that this proposal would undoubtedly lead to some confusion and frustration. For a first iteration, I would like to instead propose:

  1. Check the change-id of each cherry pick against gerrit. If the status in gerrit is MERGED or ABANDONED, remove the cherry pick.
  2. Check if there is a Bug: TXXXXX in the commit, if there is and there hasn't been any activity on the task in 1 month add a message to the task that there is a cherry pick for this task on beta.

Basically, we'd let folks get used to beta being aware of patches and poking tasks and only remove patches when it's clear that the patch should no longer exist as a cherry-pick on beta.

This is a less aggressive approach than the initial proposal, but I think this is a better balance of the ease-of-use of beta-cluster cherry-picks and beta-cluster maintainers ability to monitor and track changes there.

Created https://phabricator.wikimedia.org/p/beta-puppetmaster/ bot user to comment on tasks. Refining puppet patch to implement a cron that will do what is described above.

Created https://phabricator.wikimedia.org/p/beta-puppetmaster/ bot user to comment on tasks. Refining puppet patch to implement a cron that will do what is described above.

Perhaps it would be also worth creating a similar phab tag that could be added automatically to tasks with the Bug: stanza on cherry-picks?

As I've been working through this and refining I've realized that this proposal would undoubtedly lead to some confusion and frustration. For a first iteration, I would like to instead propose:

  1. Check the change-id of each cherry pick against gerrit. If the status in gerrit is MERGED or ABANDONED, remove the cherry pick.
  2. Check if there is a Bug: TXXXXX in the commit, if there is and there hasn't been any activity on the task in 1 month add a message to the task that there is a cherry pick for this task on beta.

Sounds like a good plan to me! Especially #1 will be very important to keep beta self-maintaining as much as possible.

Re: 2. if a Bug: is missing a fallback might be to comment on the review itself.
Not for the first iteration but I'd argue that removing an unmerged patch that's been sitting in Beta for e.g. a quarter would be good too. Opting-in again by cherry-picking again in beta is easy enough anyways.

Change 310719 had a related patch set uploaded (by Thcipriani):
[WIP] Beta: Clean puppetmaster cherry-picks

https://gerrit.wikimedia.org/r/310719

Fixed up https://gerrit.wikimedia.org/r/310719 based on some review from @Volans.

Additional review is welcome! I think this could merge any time with minimal disruption.

root@deployment-puppetmaster02:/var/lib/git/operations/puppet# git log --oneline origin/production..HEAD
dd8fcf7762 Followup Ia5d07908: Fix sentry's base::service_unit to require correct class
b757ca8375 profile: fix udev reload dependency for swift::storage::labs
1522749ac1 logstash: Parse nginx access logs for wdqs
18f3c24c9f thcipriani: Fix for service dependency loops
9a856bda35 Change $deploy_user home directory to /var/lib/${deploy_user}
a00102fb5e varnish: Add --errorpage-noise trigger for easy testing
68195addd4 Scap: scap_source correct gid
cd7547c9ce swift: use implicit /dev/swift prefix for swift devices
27401356a3 swift: lower replication interval for beta
7df2bf76df Do the echo when running update.php
42a8e6578e Make /entity/ redirect internal
67a22aba41 Add 3d2png deploy repo to image scalers
51806cf9f4 interface: IPAddr.new() requires an address family
72da862e18 Add a git_repo to global scap.cfg
ebb1850387 service: use gzip for logging in uwsgi
78eff1f8f1 [LOCAL] Add cert for etcd in deployment-prep, hiera data for instance
90ae1ea1ff salt: fix grain-ensure comparison
3f30e5eced contint: move from /mnt to /srv
ad04b266e4 hacks to fix puppet --krenair
b1ba161f44 POC: Secure redirect service
06a97e32de deployment-prep: Make LVS config compatible with new requirements
88cb2ef622 [WIP] logstash: send errors to sentry
a7dceac3b8 logstash: parse runJobs messages

was 23 in krenair's post above, now 18:

gjg@deployment-puppetmaster02:/var/lib/git/operations/puppet$ git log --pretty=format:"%h (%ar) %an - %s" --graph --date=short -13
* 347d03ea6a (11 months ago) Alex Monk - puppetmaster: hacks to fix puppet logstash
* c7acd3ea4c (12 months ago) Alex Monk - deployment-prep: Make LVS config compatible with new requirements
* f691522d23 (9 days ago) Antoine Musso - prometheus: force ferm dns resolution to Ipv4
* a84afe7c2f (10 days ago) Moritz Muehlenhoff - Add support for stretch to hhvm::debug
* 863c77bab3 (7 weeks ago) Alex Monk - Fix mwrepl to require expanddblist dependency, from scap::scripts
* 39fa12377c (7 weeks ago) Alex Monk - Followup Ia5d07908: Fix sentry's base::service_unit to require correct class
* 9bfa6eddcb (1 year, 3 months ago) Bryan Davis - logstash: Parse nginx access logs for wdqs
* 2382d4d826 (2 months ago) Root - thcipriani: Fix for service dependency loops
* 023dd6c75a (5 months ago) Timo Tijhof - varnish: Add --errorpage-noise trigger for easy testing
* 23e7f77f1a (3 months ago) Tyler Cipriani - Scap: scap_source correct gid
* 5cc8f829b8 (3 months ago) Filippo Giunchedi - swift: use implicit /dev/swift prefix for swift devices
* 6feca41055 (7 months ago) Antoine Musso - swift: lower replication interval for beta
* 6d2289298c (8 months ago) Antoine Musso - interface: IPAddr.new() requires an address family
gjg@deployment-puppetmaster02:/var/lib/git/operations/puppet$ git log --oneline origin/production..HEAD | wc -l
18
  • 023dd6c75a (5 months ago) Timo Tijhof - varnish: Add --errorpage-noise trigger for easy testing

I've removed this one from beta cluster's puppetmaster.

root@deployment-puppetmaster03:/var/lib/git/operations/puppet# git log --pretty=format:"%h (%ar) %an - %s" --graph --date=short origin/production..HEAD
* 04256bbf72 (3 days ago) Alex Monk - Attempt to secure Puppet DB better
* e658326428 (3 days ago) Alex Monk - Puppetise simple no-CA class for deployment-dumps-puppetmaster02
* 3b5d916f6d (2 weeks ago) Alex Monk - Central certificates / Secure redir WIP
* 71f564f167 (3 days ago) Alex Monk - cumin: Allow Puppet DB backend to be used within Labs projects that use it
* 91778aa0fe (5 months ago) Alex Monk - swift: Fix checks on drive/filesystem titles to allow for labs ones
* ac20f079e3 (11 months ago) Filippo Giunchedi - swift: use implicit /dev/swift prefix for swift devices
* 0b7d70f7b9 (5 days ago) Timo Tijhof - webperf: Add statsv, navtiming and coal to scap::sources
* 6c5de6eec5 (11 months ago) Tyler Cipriani - Scap: scap_source correct gid
* 1da2ac8835 (10 days ago) Alex Monk - Allow PuppetDB use on standalone puppetmasters
* d322869448 (4 weeks ago) Mukunda Modell - WIP: Add phabricator config for the new swift backend
* fc711b8c3d (4 weeks ago) Mukunda Modell - Add account for phabricator_files to swift::params::accounts
* 9c16df4168 (5 months ago) Root - Hack profile::base::firewall to prevent dupe definition
* ff6ddbe8d9 (7 months ago) Filippo Giunchedi - hieradata: add redis stretch deployment-prep instances
* 6a822dc76b (8 months ago) Antoine Musso - prometheus: force ferm dns resolution to Ipv4
* 2dd3f86432 (1 year, 7 months ago) Alex Monk - puppetmaster: hacks to fix puppet logstash
* 2684f4571f (1 year, 8 months ago) Alex Monk - deployment-prep: Make LVS config compatible with new requirements
* 43a6d4a39c (10 months ago) Alex Monk - Fix mwrepl to require expanddblist dependency, from scap::scripts
* 44cd03158f (11 months ago) Root - thcipriani: Fix for service dependency loops
* 822bc923c4 (1 year, 2 months ago) Antoine Musso - swift: lower replication interval for beta
* ac2e48f579 (1 year, 4 months ago) Antoine Musso - interface: IPAddr.new() requires an address family
* 901afdfa01 (2 years, 5 months ago) Gergő Tisza - [WIP] logstash: send errors to sentry

I wonder if we should start explicitly separating these lists of cherry-picked commits into those which are open in gerrit (i.e., already pending review) vs. those which are not

I made a very crappy script at deployment-puppetmaster03:/root/make_T135427_table.py containing this:

make_T135427_table.py
#!/usr/bin/python3
import json
import os
import urllib.request
gerrit_req_url = 'https://gerrit.wikimedia.org/r/changes/?q={}'
out = os.popen('git --git-dir=/var/lib/git/operations/puppet/.git log --pretty=format:"%h NOONEWILLUSETHISSEPARATOR %ar NOONEWILLUSETHISSEPARATOR %an NOONEWILLUSETHISSEPARATOR %s NOONEWILLUSETHISSEPARATOR %b NOONEWILLUSETHISLINEBREAK" --date=short origin/production..HEAD').read()
print('<table>')
print('<tr><th>Hash</th><th>Time</th><th>Author</th><th>Subject</th><th>Change URL</th><th>Task</th></tr>')
for commit in out.split('NOONEWILLUSETHISLINEBREAK\n'):
    if not commit:
        continue
    hash, time, author, subject, message = commit.split(' NOONEWILLUSETHISSEPARATOR ')
    change_url = None
    task = None
    for line in message.splitlines():
        if line.startswith('Change-Id:'):
            _, change_id = line.split(':')
            change_id = change_id.strip()
            resp = json.loads(urllib.request.urlopen(gerrit_req_url.format(change_id)).read()[5:-1].decode('utf-8'))
            if not len(resp):
                break
            gerrit_change, = resp
            change_url = 'https://gerrit.wikimedia.org/r/c/{}'.format(gerrit_change['_number'])
            if gerrit_change['status'] != 'NEW':
                change_url  = ' - '   gerrit_change['status']
        elif line.startswith('Bug:'):
            _, task_id = line.split(':')
            task = task_id.strip()
    print('<tr>')
    for item in (hash, time, author, subject, change_url, task):
        print('<td>{}</td>'.format(item))
    print('</tr>')

print('</table>')

It prints this:

HashTimeAuthorSubjectChange URLTask
f14563b86f5 days agoTimo Tijhofwebperf: Add statsv, navtiming and coal to scap::sourceshttps://gerrit.wikimedia.org/r/c/436601T195314
8d16ba10293 days agoAlex MonkAttempt to secure Puppet DB betterNoneNone
b07dd175193 days agoAlex MonkPuppetise simple no-CA class for deployment-dumps-puppetmaster02NoneNone
41f26c4e4d2 weeks agoAlex MonkCentral certificates / Secure redir WIPNoneT194962
e4283876423 days agoAlex Monkcumin: Allow Puppet DB backend to be used within Labs projects that use ithttps://gerrit.wikimedia.org/r/c/437052None
062a446d7b5 months agoAlex Monkswift: Fix checks on drive/filesystem titles to allow for labs oneshttps://gerrit.wikimedia.org/r/c/402758T184236
454dea246e11 months agoFilippo Giunchediswift: use implicit /dev/swift prefix for swift deviceshttps://gerrit.wikimedia.org/r/c/361648T163673
6c5de6eec511 months agoTyler CiprianiScap: scap_source correct gidhttps://gerrit.wikimedia.org/r/c/361796None
1da2ac883510 days agoAlex MonkAllow PuppetDB use on standalone puppetmastershttps://gerrit.wikimedia.org/r/c/435631T194962
d3228694484 weeks agoMukunda ModellWIP: Add phabricator config for the new swift backendhttps://gerrit.wikimedia.org/r/c/432533None
fc711b8c3d4 weeks agoMukunda ModellAdd account for phabricator_files to swift::params::accountshttps://gerrit.wikimedia.org/r/c/432528None
9c16df41685 months agoRootHack profile::base::firewall to prevent dupe definitionNoneNone
ff6ddbe8d97 months agoFilippo Giunchedihieradata: add redis stretch deployment-prep instanceshttps://gerrit.wikimedia.org/r/c/386869T148637
6a822dc76b8 months agoAntoine Mussoprometheus: force ferm dns resolution to Ipv4https://gerrit.wikimedia.org/r/c/381073 - ABANDONEDT176314
2dd3f864321 year, 7 months agoAlex Monkpuppetmaster: hacks to fix puppet logstashNoneNone
2684f4571f1 year, 8 months agoAlex Monkdeployment-prep: Make LVS config compatible with new requirementshttps://gerrit.wikimedia.org/r/c/316512None
43a6d4a39c10 months agoAlex MonkFix mwrepl to require expanddblist dependency, from scap::scriptshttps://gerrit.wikimedia.org/r/c/372764None
44cd03158f11 months agoRootthcipriani: Fix for service dependency loopsNoneT171173
822bc923c41 year, 2 months agoAntoine Mussoswift: lower replication interval for betahttps://gerrit.wikimedia.org/r/c/344387T160990
ac2e48f5791 year, 4 months agoAntoine Mussointerface: IPAddr.new() requires an address familyhttps://gerrit.wikimedia.org/r/c/336840 - ABANDONEDNone
901afdfa012 years, 5 months agoGergő Tisza[WIP] logstash: send errors to sentryhttps://gerrit.wikimedia.org/r/c/263024T85239

@Krenair: I could probably make a conduit endpoint to receive that table and update the task from inside a cron job.

@Krenair: I could probably make a conduit endpoint to receive that table and update the task from inside a cron job.

I'm not sure if we want to go to quite that much effort, I certainly didn't in the above script, but wouldn't the existing maniphest.edit endpoint suffice if you really wanted to?

@Krenair: Good point, that probably would do it.

Anyway just to be more unhelpful I'm going to take that automatically generated table and add helpful manual comments to it:

HashTimeAuthorSubjectChange URLTaskComments
f14563b86f5 days agoTimo Tijhofwebperf: Add statsv, navtiming and coal to scap::sourceshttps://gerrit.wikimedia.org/r/c/436601T195314puppet/deployment development
8d16ba10293 days agoAlex MonkAttempt to secure Puppet DB betterNoneNonepuppet development, could be removed safely
b07dd175193 days agoAlex MonkPuppetise simple no-CA class for deployment-dumps-puppetmaster02NoneNonebeta-specific puppet infra thing to try to make -snapshot01 less of an edge case. I basically shifted the problem a bit - might be able to get rid of this, not sure
41f26c4e4d2 weeks agoAlex MonkCentral certificates / Secure redir WIPNoneT194962puppet/infrastructure development for something primarily being developed in deployment-prep intending to go to prod later. not yet ready for review. could theoretically be worked on outside beta
e4283876423 days agoAlex Monkcumin: Allow Puppet DB backend to be used within Labs projects that use ithttps://gerrit.wikimedia.org/r/c/437052Nonepuppet/infrastructure development, could be removed safely (though this would be sad). get ops to review
062a446d7b5 months agoAlex Monkswift: Fix checks on drive/filesystem titles to allow for labs oneshttps://gerrit.wikimedia.org/r/c/402758T184236unbreaks puppet on ms-be boxes. might be easier to get through review if we could get rid of the dependency (see below)
454dea246e11 months agoFilippo Giunchediswift: use implicit /dev/swift prefix for swift deviceshttps://gerrit.wikimedia.org/r/c/361648T163673dependency of commit that unbreaks puppet on ms-be boxes. I forget why this one was involved, I think maybe it used to be a cherry-pick while in development and then our puppet fix above had to depend on it. Was authored by ops so it's not a usual candidate for becoming stuck in gerrit
6c5de6eec511 months agoTyler CiprianiScap: scap_source correct gidhttps://gerrit.wikimedia.org/r/c/361796Nonecan't remember what this was
1da2ac883510 days agoAlex MonkAllow PuppetDB use on standalone puppetmastershttps://gerrit.wikimedia.org/r/c/435631T194962dependency of central certificates service commit above, brings beta closer to prod - ask ops to review
d3228694484 weeks agoMukunda ModellWIP: Add phabricator config for the new swift backendhttps://gerrit.wikimedia.org/r/c/432533Nonedunno
fc711b8c3d4 weeks agoMukunda ModellAdd account for phabricator_files to swift::params::accountshttps://gerrit.wikimedia.org/r/c/432528Nonepresumably dependency of the above
9c16df41685 months agoRootHack profile::base::firewall to prevent dupe definitionNoneNonesounds like this is to fix puppet issues, though it seems more like a workaround than an actual fix - make a task to investigate killing this?
ff6ddbe8d97 months agoFilippo Giunchedihieradata: add redis stretch deployment-prep instanceshttps://gerrit.wikimedia.org/r/c/386869T148637ticket was closed, patch was made by ops so not a usual candidate to get stuck in gerrit - patch looks like a very simple deployment-prep-only change anyway, should be (relatively) easy to get reviewed
6a822dc76b8 months agoAntoine Mussoprometheus: force ferm dns resolution to Ipv4https://gerrit.wikimedia.org/r/c/381073 - ABANDONEDT176314ticket is (well, was) a tangent to the problem being worked around, complicated problem seemingly deep within perl's Net::DNS library, see T153468. this particular patch is -2'd in Gerrit, it's really more of a bad workaround
2dd3f864321 year, 7 months agoAlex Monkpuppetmaster: hacks to fix puppet logstashNoneNoneI should probably re-review this and figure out why prod wasn't also having problems with this
2684f4571f1 year, 8 months agoAlex Monkdeployment-prep: Make LVS config compatible with new requirementshttps://gerrit.wikimedia.org/r/c/316512NoneI think this is just to stop puppet problems in beta relating to services that would normally be behind LVS in prod (we can't use LVS due to nova-network's source destination checking on IP packets IIRC)
43a6d4a39c10 months agoAlex MonkFix mwrepl to require expanddblist dependency, from scap::scriptshttps://gerrit.wikimedia.org/r/c/372764NoneMakes a dependency that already exists explicit in puppet, unfortunately Jenkins doesn't like this anymore as they added new rules. I found Ariel had also cherry-picked this to deployment-dumps-puppetmaster
44cd03158f11 months agoRootthcipriani: Fix for service dependency loopsNoneT171173fixed puppet error in beta, might be possible to make beta more like prod instead, asking on ticket
822bc923c41 year, 2 months agoAntoine Mussoswift: lower replication interval for betahttps://gerrit.wikimedia.org/r/c/344387T160990Ticket may have been incorrectly closed if it depended on this patch being in place as it was only a cherry-pick. If it didn't depend on this commit we should be able to remove the commit from our cherry-picks and abandon the patch in gerrit
ac2e48f5791 year, 4 months agoAntoine Mussointerface: IPAddr.new() requires an address familyhttps://gerrit.wikimedia.org/r/c/336840 - ABANDONEDNoneI think this was only in beta for testing that it wouldn't break existing installs as no beta machine appears to have the version of ruby that was apparently problematic here
901afdfa012 years, 5 months agoGergő Tisza[WIP] logstash: send errors to sentryhttps://gerrit.wikimedia.org/r/c/263024T85239puppet/infrastructure development for something probably intending to go to prod later

I've gone through and hash-tagged the Gerrit patches with beta-picked and restored any that were marked as abandoned:

https://gerrit.wikimedia.org/r/#/q/hashtag:beta-picked is:open

I'm hoping this'll be an easier way to maintain.

It should also provide better visibility to SRE when interacting with such patch. For example, when abandoned such patch, the hash-tag helps remember that they need to be unpicked in Beta. And that if such patch is amended/merged, to check Beta's puppetmaster afterwards for rebase conflicts and to resolve those as needed.

17:

HashTimeAuthorSubjectChange URLTaskComments
f3fe7d568b4 months agoAlex Monk[WIP] Central certificates servicehttps://gerrit.wikimedia.org/r/c/441991T194962puppet/infrastructure development for something being tested in deployment-prep intending to go to prod later. not yet ready for review. could theoretically be worked on outside beta
d8c78954912 weeks agoGiuseppe Lavagettomediawiki: move php to a profile, use the php classhttps://gerrit.wikimedia.org/r/c/453093T201140
b6ee8bc2693 months agoAlex Monkcumin: Allow Puppet DB backend to be used within Labs projects that use ithttps://gerrit.wikimedia.org/r/c/437052Nonepuppet/infrastructure development, could be removed safely (though this would be sad). get ops to review
ba09e14a2d7 weeks agoMoritz MuehlenhoffMove declaration of diamond package out of diamond classhttps://gerrit.wikimedia.org/r/c/446242T183454
44b7c0fd373 months agoGilles DubucServe WebP variants for the hottest thumbnailshttps://gerrit.wikimedia.org/r/c/434055T27611
66c14a774a8 weeks agoroot[LOCAL HACK] tls certs for deployment-elastic*NoneNone
c33430ddc23 months agoAlex MonkRe-combine labs and production exim minimal confighttps://gerrit.wikimedia.org/r/c/439774NoneThis makes our wiki mail run through deployment-mx02 instead of the production MXes
876e58014b3 months agoAlex MonkAttempt to secure Puppet DB betterNoneNonepuppet development, could be removed safely
c6628d36503 months agoAlex MonkPuppetise simple no-CA class for deployment-dumps-puppetmaster02NoneNonebeta-specific puppet infra thing to try to make -snapshot01 less of an edge case. I basically shifted the problem a bit - might be able to get rid of this, not sure
837baf92c88 months agoAlex Monkswift: Fix checks on drive/filesystem titles to allow for labs oneshttps://gerrit.wikimedia.org/r/c/402758T184236unbreaks puppet on ms-be boxes. might be easier to get through review if we could get rid of the dependency (see below)
51ce87290d1 year, 2 months agoFilippo Giunchediswift: use implicit /dev/swift prefix for swift deviceshttps://gerrit.wikimedia.org/r/c/361648T163673dependency of commit that unbreaks puppet on ms-be boxes. I forget why this one was involved, I think maybe it used to be a cherry-pick while in development and then our puppet fix above had to depend on it. Was authored by ops so it's not a usual candidate for becoming stuck in gerrit
14cb98815d1 year, 2 months agoTyler CiprianiScap: scap_source correct gidhttps://gerrit.wikimedia.org/r/c/361796None
23aa019c794 months agoMukunda ModellAdd account for phabricator_files to swift::params::accountshttps://gerrit.wikimedia.org/r/c/432528None
736484f7908 months agoRootHack profile::base::firewall to prevent dupe definitionNoneNonesounds like this is to fix puppet issues, though it seems more like a workaround than an actual fix - make a task to investigate killing this?
2782db370d11 months agoAntoine Mussoprometheus: force ferm dns resolution to Ipv4https://gerrit.wikimedia.org/r/c/381073T176314ticket is (well, was) a tangent to the problem being worked around, complicated problem seemingly deep within perl's Net::DNS library, see T153468. this particular patch is -2'd in Gerrit, it's really more of a bad workaround
6c883f71da1 year, 5 months agoAntoine Mussoswift: lower replication interval for betahttps://gerrit.wikimedia.org/r/c/344387T160990Ticket may have been incorrectly closed if it depended on this patch being in place as it was only a cherry-pick. If it didn't depend on this commit we should be able to remove the commit from our cherry-picks and abandon the patch in gerrit
ac40281c792 years, 8 months agoGergő Tisza[WIP] logstash: send errors to sentryhttps://gerrit.wikimedia.org/r/c/263024T85239puppet/infrastructure development for something possibly intending to go to prod later?

This process is still pretty broken. In fact it's blocking me right now. Not that I have any good solutions, mind you, this is more of a poke to make sure it stays on people's radar.

Today I found out that a local hack had conflicts with upstream changes some time ago and fixed beta puppet updates. We had 17 patches not in ops/puppet, I removed one:

HashTimeAuthorSubjectChange URLComments
d4a81012e45 weeks agoTaavi Väänänenredis::multidc: Make discovery optionalhttps://gerrit.wikimedia.org/r/c/669447unbreaks things, should be reviewed and merged by sre
c9c0d48da95 weeks agoGiuseppe Lavagettoprofile::etcd::v3: use puppet certs for standalone clusterhttps://gerrit.wikimedia.org/r/c/668701
33c11044815 weeks agoroot[LOCAL WIP] Fix kafka trust store location --taaviwas my attempt T276521, didn't work out
729d4cfcbf1 year, 1 month agoAlex Monk[LOCAL HACK] Allow external access from anywhere to parsoid port 80 for CI purposes
e3af1260842 years, 10 months agoAlex Monk[LOCAL HACK] Attempt to secure Puppet DB better
3bcae93e335 months agoroot[HACK] thcipriani: fix cache-text maybe
579d46eef110 months agoAlex Monk[LOCAL HACK] Make kafka_config default cluster logic actually work
1b09ba1b812 years, 7 months agoTyler CiprianiBeta: maintenance: no openldap managementhttps://gerrit.wikimedia.org/r/c/462020T125976
301783adfd2 years, 7 months agoTyler CiprianiBeta: maintenance: skip mediawiki::state functionhttps://gerrit.wikimedia.org/r/c/462019T125976
14632544f72 years, 1 month agoAlex Monk[LOCAL HACK] Crappy T153468 workaround: Drop AAAA rules
4d9dd724582 years, 1 month agoAlex Monkscap: Make wmflabs php7 behaviour match prod'shttps://gerrit.wikimedia.org/r/c/499025T219242
fcf57c63262 years, 10 months agoAlex MonkRe-combine labs and production exim minimal confighttps://gerrit.wikimedia.org/r/c/439774
9afc44f6622 years, 9 months agoroot[LOCAL HACK] tls certs for deployment-elastic*
07b8b15bfb2 years, 10 months agoAlex Monk[LOCAL HACK] Puppetise simple no-CA class for deployment-dumps-puppetmaster02Is this still needed? dumps-puppetmaster02 does not exist
5135bbd1533 years, 10 months agoFilippo Giunchediswift: use implicit /dev/swift prefix for swift deviceshttps://gerrit.wikimedia.org/r/c/361648T163673
31c86190773 years, 10 months agoTyler CiprianiScap: scap_source correct gidhttps://gerrit.wikimedia.org/r/c/361796
6b09717cb85 years agoGergő Tisza[WIP] logstash: send errors to sentryhttps://gerrit.wikimedia.org/r/c/263024 - ABANDONEDremoved by Majavah today per T85239

Change 310719 had a related patch set uploaded (by Thcipriani; author: Thcipriani):

[operations/puppet@production] Beta: Clean puppet cherry-picks

https://gerrit.wikimedia.org/r/310719

I've written a runbook for our team which I understand is being circulated by various people to reference how to do puppet cherry-picks. This isn't intended as anything official, but I guess it's worth noting here that it exists:

https://wikitech.wikimedia.org/wiki/Performance/Runbook/Puppet_patches#Beta_Cluster_testing

Change 310719 abandoned by Thcipriani:

[operations/puppet@production] Beta: Clean puppet cherry-picks

Reason:

very outdated

https://gerrit.wikimedia.org/r/310719

As of now, we are down from 20 to just 6 local patches. Two are fairly recent and being worked on in Gerrit. The other four are multi-year old local hacks that I've uploaded to Gerrit today and tagged accordingly.

The local patches are now all under this public query: https://gerrit.wikimedia.org/r/q/hashtag:beta-cherry-picked is:open.

@thcipriani I'm aware a better future always seems "just" six months away, but given the above, perhaps we can institute this policy between RelEng and QTE as soon as possible. If it ends up that we only need the policy for a few months because it's obsolete with new infra, that seems like a great problem to have - fueled by proving the disvalue of the policy. Seems like a small-cost/big-reward kind of thing if we can pitch and adopt this soon, so that the burden doesn't keep acculumating and punted elsewhere.

From my perspective the key commitments would be:

  • for "nice to have" patches, the author has to undo them within 7 days (after that, anyone can move to Gerrit and undo; author may lobby for review through normal channels, subject to owner/reviewer priorities).
  • for critical patches that otherwise result in visible breakage, the author has to submit them to Gerrit as well, and escalate to relevant teams that own the same thing in production to reconcile it within 7 days.

I think this sets a fairly low bar, e.g. we're not asking for feature parity or UBN-like on-call priority. Just making sure stuff isn't invisible, prone to rebase conflicts, and lingering. Given that we only have six local patches today, the status quo is effectively already that we just keep it working somehow within Puppet, not parity per-se. I'm hoping this is a small but meaningful commitment.

Update from various sidebar chats:

  1. challenge: teams are not required to fix Beta, given lack of org-wide agreement/funding foe teams to do in Beta what they do in prod.
  2. concern: automated rollbacks could break Beta (i.e. worse than "dirty state" where at least it works).
  3. scripting: make it easy for people to clean up after themselves.

From my point of view, the challenge with buy-in is very real but I think can be cleanly separated from this policy as this policy is specifically about cases where someone did decide to make a change in Beta but then left it in a dirty state while doing so.

I think from my POV, the part of the outcome I value most is, when someone chooses to test something new, to fix something that was broken, or to sync something in Beta with prod; that the policy requires them to clean that up within a certain number of days. I'd like to note that people already do so in most cases, but forgetting adds up, and without a policy, this keeps accumulating and shifting burden to others.

The policy would allos us to effectively "require" with a little bit of formal pressure (if needed) that people either clean up the test, or finish the fix through puppet, or safely back it out until a time they can do it properly (or be okay with someone else reverting it on their behalf if no longer needed).

I don't mind if that initially happens without autononous automation. I agree with Tyler that unattended automation rolling back changes may be too disruptive to start with. Having an IRC alert, or even just a command that prints a date and number ad-hoc with link to the policy that says "If this date is X in the past, you may ask this person, and that person should do stuff", would be a good start I think.

In terms of scripting and accessibility, the current way I tend to remove patches is with a single command: "git rebase -i". This interactive interface allows one to remove a single stale patch in fairly user friendly way. We could provide wrappers or some automation for that if needed.

Update from various sidebar chats:

  1. challenge: teams are not required to fix Beta, given lack of org-wide agreement/funding foe teams to do in Beta what they do in prod.
  2. concern: automated rollbacks could break Beta (i.e. worse than "dirty state" where at least it works).
  3. scripting: make it easy for people to clean up after themselves.

From my point of view, the challenge with buy-in is very real but I think can be cleanly separated from this policy as this policy is specifically about cases where someone did decide to make a change in Beta but then left it in a dirty state while doing so.

I think from my POV, the part of the outcome I value most is, when someone chooses to test something new, to fix something that was broken, or to sync something in Beta with prod; that the policy requires them to clean that up within a certain number of days. I'd like to note that people already do so in most cases, but forgetting adds up, and without a policy, this keeps accumulating and shifting burden to others.

The policy would allos us to effectively "require" with a little bit of formal pressure (if needed) that people either clean up the test, or finish the fix through puppet, or safely back it out until a time they can do it properly (or be okay with someone else reverting it on their behalf if no longer needed).

Sounds like the outcome needed here is agreement about a 7-day policy being the norm for patches left on the beta puppetmaster. I'd be fine with that if @Jrbranaa is fine with that. We could add a note to Deployment-prep/How code is updated on Wikitech.

I don't mind if that initially happens without autononous automation. I agree with Tyler that unattended automation rolling back changes may be too disruptive to start with. Having an IRC alert, or even just a command that prints a date and number ad-hoc with link to the policy that says "If this date is X in the past, you may ask this person, and that person should do stuff", would be a good start I think.

In terms of scripting and accessibility, the current way I tend to remove patches is with a single command: "git rebase -i". This interactive interface allows one to remove a single stale patch in fairly user friendly way. We could provide wrappers or some automation for that if needed.

Yeah, my only worry is what happens when you have a patch sit for longer than 7 days. Auto-removing a critical fix may make beta even less stable. An IRC alert would notify people in #wikimedia-releng, but maybe not the owner of the patch. I'd be fine poking the gerrit change, I suppose (if we require one).

From my perspective the key commitments would be:

  • for "nice to have" patches, the author has to undo them within 7 days (after that, anyone can move to Gerrit and undo; author may lobby for review through normal channels, subject to owner/reviewer priorities).
  • for critical patches that otherwise result in visible breakage, the author has to submit them to Gerrit as well, and escalate to relevant teams that own the same thing in production to reconcile it within 7 days.

As I'm looking at the currently cherry-picked patches, the ones titled [BETA HACK] all require rework and investigation (and would likely cause critical breakage to remove). That is, some patches are made to "fix" beta fast and can't merge in their current state. I wonder what to do with those, too—7 days seems like short turn around for those.


Seems like there are two classes of patches:

  1. Author is testing out a puppet change they intend to merge
  2. Fixing beta so it works again

The former can be removed via an automated process every 7 days. The later should adhere to some convention so it's not accidentally removed, but can still have some automated alerting for folks relevant to the fix (not necessarily the patch author). What do you think of that @Krinkle?

I don't think we need to tolerate "fixing beta" patches in local state for more than 7 days, either. It can either be reverted similarly if the broken state isn't too bad, or submitted to Gerrit and either merged there with a minimal logic branch (e.g. ternary condition) or addressed "properly" for some definition of properly.

The existing five (Oct 2023) three (Dec 2023) two (Feb 2024) patches we have we can deal with separately, but this is about no longer introducing new such patches in a lasting manner. For production we tolerate this for seconds at the most, if ever. 7-days seems like a long enough period of time to express that Beta is of lower priority to figure out the appropiate short-term solution in a way that is still within the Puppet repo instead of outside of it.

The point being that tolerating it longer, means we tolerate other people dealing with the undocumented and divergent state and rebasing changes to areas they're not familiar with. If we tolerate that, I don't see the point of having the policy.

In other words: I agree of course that they have wildely different reasons and impact, but the end result and motivation for the policy is exactly the same. Don't require other people to clean up the mess. And if a mess does happen, deal with it in a minimum viable manner within 7 days. This usually just means reaching out to relevant folks on IRC or Ops-l and asking for a review or puppet deploy window. There's afaik nothing inherent about these kinds of hacks that makes them unmergable, its just that nobody tries or is expected to.

Post-poning this inevitable work incurs expontential costs on everyone else. And unlike other tech debt, these hacks tend not to get go away on their own, i.e. there isn't a % chance where if we ignore it long enough it'll somehow solve itself.