If you’re a developer on a team running an online-service, the day will come where you break production. Not just the build. Not some automated tests. Not the development or staging environment. You will break the real thing that’s out there and used by customers.
Why? Because no matter how many layers of safeguards and process you apply, no matter how thorough your QA team (if existent) works – you still can’t predict the future. You’ll never know the exact sequence of bits flowing from clients to servers, from datastores to business tiers to front-ends.
Move fast and break things – not only at Facebook
At Stack Exchange, we rely on a variety of safeguards to keep the Careers 2.0 site up and running, such as unit tests, integration tests, UI tests and perceptual diffs. Shipping a change to production is fully automated, but it requires all these tests to pass first. So, as a developer I can ship quickly and as often as I want (which is fun) while still being protected to some degree from destroying our live site with the click of a button.
Still, a few months back, I managed to break production. Here’s how: I checked in a database script which worked locally and on our dev environment, but then threw an error in production due to a difference in data. As said: atoms – universe – not can predict..
By the way – even if a production build fails, it doesn’t necessarily mean that the entire site is broken from a user’s perspective. Let’s say we’re currently running version n and kick off the build to deploy version n+1. In case of a deployment failure, we would ideally expect a clean rollback to version n. Well, what happened in this particular case was: my change got bundled up with a few others, includig a set of database migrations. And of course my db migrations failed after previous migrations, authored by someone else on my team, have already succeded. So, the database was basically now somwhere in state “n+0.5″, while the web tier was still running version n. This inconsistency turned out to be fatal.
We need to talk
The day you break production might be a stressful one for you personally. But the good news is, It’s also a day you’ll to find out what kind of team you’re on. If you get the opportunity to fix the problem, apologize for your mistake, and (ideally) take some steps to prevent the same type of failure from happening again, you’re on a good team. Someone might still yell at you in the heat of the moment. Don’t take it personally – the consequences are potentially huge if your site is down. Your company might be losing money with every minute of downtime. News about your failure might already be spreading on social media. Pressure is on.
Quite different story though if you find yourself reporting to eight different bosses after the incident, with fingers pointed at you all over the place. In case you’re part of a post-mortem discussion, pay close attention: those can be incredibly useful and constructive, and also a gigantic waste of time.
If you’re on that last team, run. Not only because chances are you didn’t enjoy getting yelled at that much. But much more importantly, in the long run you’ll find the entire team exercising risk aversion. If everyone’s scared to death of getting hassled (or fired) after breaking something, they will try to make sure that won’t happen – but at a high cost: be prepared for long sign-off meetings, senseless CYA email threads, and the blame game to already start before a single bit has been shipped. In extreme cases, it’ll take anywhere between forever and eternity to get anything done.
Well, now that being said…
Of course, you can’t be expect your team to be very forgiving if you do any of the following:
- deliberately circumvent existing rules and procedures
- try to cover up your mistake
- make the same mistake several times
- refuse to take accountability
- generally signal that you don’t care that much
Ever worked with people who do that? It’s basically the dark side of “you should be allowed to break stuff”. If you find yourself in an environment where this kind of behavior is tolerated, you also better run fast – before assimilation sets in.
Won’t happen again!
In the case of my failed database migration, I ended up improving our internal migrator tool, so all database migrations are now by default wrapped in a global transaction. No failed migration scripts have brought our site down since then – yay!
So, how good is your team? If you don’t know now, don’t worry. You’ll find out.