About the series
This is part of a series of posts dedicated to talk about the biggest fears that we face as developers.
There’s always a first time
If you’re a new dev, you might be thinking: ”ha! It’s never happened to me”. Well, sorry to ruin your moment, but it’s just a matter of time.
On the other side, people who have been coding for a while probably already did it or know someone who did and have some great stories to tell!
But, don’t get me wrong, this is a good thing. Other than having funny stories to tell at the office, either if it was you or someone else, breaking production is one of the best moments that you have to learn and grow.
And talking about funny stories, let me tell you 3 of my favorites. But more importantly, let’s see what we all learned from them and how that helped us to improve.
The Jr hire that deployed on the first day 😱
It was the first day for a new junior dev. As part of the onboarding process, we let them follow the README with instructions on how to pull and set up the project locally.
We were focused on our stuff when suddenly received a notification that a production deploy had been made... by the new guy.
After a few seconds of panic, we determined that the deploy was triggered by a push to the canonical repo with no changes. Fortunately, nothing got broken and there were no casualties. What happened next?
Created a permissions policy: we realized that we kept no track of who has access to what, and that allowed the guy to push to canonical when he shouldn't. After that, all grants were revoked and a new process was set up to ask for access as needed.
Improved the README: we also noticed that the root of the problem was that the document wasn’t clear on how/where to run the command. So, we updated it and also start encouraging people to update it during the onboarding if they notice something wrong with it.
The SQL query without WHERE condition 😬
This is a common one, especially if you work with data.
There were a bunch of queries that need to be executed to update records on the database. The guy was selecting and running one query at the time and at some point he started screaming: "rollback, rollback!!".
He half selected the last query and didn't include the WHERE clause, updating ALL the records in the database.
What did we learn from it?
Backups are really important: thankfully, a backup was created before running the queries so it was easy to restore it. However, especially if it's a routine process, it's really easy to forget about backups and how important they are. Always make sure to create copies before starting any risky process.
Always test before running it live: it doesn't matter if it's a query, a command, or a script. It's important to have another environment to test before doing it in production.
💡 pro-tip: Start your queries by writing the WHERE clause, that way you make sure you don't forget about it.
The day that rollback didn’t fix it 💀
This one actually happened to me.
We were seeing an issue on the staging environment causing the page not to load correctly and leaving the application on a useless blank page.
We found the issue (or that's what we thought) and boom! It worked on staging. We immediately deployed to production, and it was supposed to be safe, but then we got the same error.
So we did what we always do, revert the last deploy. It makes sense, right? Well, it didn't work!!
It took us 1 hour to figure out the problem. Which means that users were unable to use the site for 1 hour and that's bad, really bad. And here's what we learned from it:
The backup plan won't always work: we learned this the hard way. The revert was our backup plan in case of fire, but it's not bulletproof. It's useful to have a backup plan for your backup plan. In our case, it was to call DevOps team.
It's not always code's fault: After all, we discovered that the issue wasn't in the code that we deployed. Instead, it was a configuration in the deployment process causing the dependencies not to update properly. So, just don't assume it's always the code, and try to see the whole picture instead.
Conclusions
Be careful!! Always double-check what you're doing and, if possible, ask someone else to look at it.
See something? Say something! The sooner you report the issue and everyone is aware of it, the sooner it can be solved and stop it from getting it worse.
Always learn as much as you can from it. I'm not saying that breaking production is a good thing nor encouraging you to do it. But if it happens, as long as we come out of it with a lesson learned or a process improved, it should be way less painful.
Have you ever broken production? Do you have any funny stories to share? Tell us!
Top comments (9)
Ooooooh, there was that time where I
mv x y
by doingmv y x
. That obviously did not work, and I ended up wiping an sqlite database like a pro.Conclusion: backups are your friend. If you're doing destructive actions, do
mv x x-bak
before doingmv x y
.When your volume is small and your sqlite is big, however..... a story for another day.
Here is One of My Production Breaking Story.
We are two teams currently working on our Final Year Project, and I have the responsibility to deploy the fresh build and merge the pull requests of team members as well.
One day, early in the morning one team member messaged me that there is a New PR that I have to merge and deploy the new build on the production server.
I said, "Okay, I am on to it.". That day my WiFi Router was not working so I turned on my Mobile Hotspot and Started to Upload Production Build File to the Server, but due to data loss files were not uploading correctly, so I tried to add SSH key of my ubuntu os with FileZilla but the known_hosts file got corrupted and I lost access to the server and nothing was deployed on the server.
So I asked our supervisor to reset the root password but he said, he is not able to reset the password, so I had to buy a new domain and new hosting and set up everything from scratch.
Moral of the Story: New Access the Files of the Server From Localhost if you do not have a stable internet connection
Thanks for sharing! And yes, a bad internet connection can be a problem. There are also other things that we consider an impediment to deploy, for example, GitHub being down.
Fortunately, We Have Not Implemented any CI/CD Yet So GitHub is Not a Problem :)
Had a project with test and production instances running on the same server, deployments done through Ansible playbooks. One playbook was for copying production database and uploaded files to the test instance. That playbook was written by an ex co-worker and it worked well for 3 years. I have never touched it since there was no need.
One day while copying production to test something broke real bad, there were database tables missing, random errors, had no idea what happened. Turns out Ansible just added an option to the MySQL module which adds
use <databasase-name>
to the dumped file in one of the minor versions. Of course the copying script ran through root database user, so instead of overwriting the test database with the production one, it was overwriting the production database instead, because both databases were on the same server. It happened while the application was running. I also ran the script 3 times before we finally figured out what the hell happened, fortunately I was able to set things straight eventually, but it was bad.Lesson 1: verify the scripts you use for deployment. If something can possible break anything, sooner or later it will.
Lesson 2: do not be lazy, do not use root privileges unless absolutely necessary. Had that copying script been using specific database users instead of root, there would be a connection error and no problems.
Lesson 3: keep in touch with changes in the tools you are using, especially Ansible, since it has a lot of essential modules (like myslq_db) not giving backward compatibility guarantee.
That didn't help with a support ticket we had one day on my team. Ends up, he typoed a semicolon before the WHERE, terminating the statement. But that wasn't caught in code review because the WHERE was there.
We had been trying to get backup testing prioritized for years, but at least they worked when we needed them :)
The database thing happened with me. I dont exactly remember what the issue was, but for some reason I had to delete the database. Oh yes I was updating postgres and somehow effed up. Thankfully before the update, I had taken a backup of the database. But then it took me an hour to figure out how to restore the dump. I had a good scare, and its funny when I now think about it
Sorry about that ;) That was also the day where I learned that queries are particular fast when you don't want them to happen.