Hacker News new | ask | show | jobs
by ajdecon 2314 days ago
One of my favorite recurring team conversations is the one where everyone shares stories of the outages they've caused or the systems they've broken. This conversation has happened eventually on every SRE (sysadmin/PE/devops/whatever) team I've joined, usually when a junior team member causes their first outage and is having an emotional meltdown. I remember my own meltdown of that form, and I remember it helped hearing about the terrible problems my friends and mentors had caused in their turn.

The first outage where I thought I was going to get fired: I was working on a system that had a single-point-of-failure server, and through a mishap with rsync I accidentally destroyed the contents of /etc. That SPOF also had no backups. (I'm not claiming it was well-designed...) Thankfully the job that depended on that server would not kick off until morning, so my team slowly reconstructed its functions on a separate machine and swapped it in behind the scenes. I helped as much as I could while vibrating with anxiety, and my team was incredibly kind throughout. I was not in fact fired. :-)

The most recent outage I caused? Yesterday! I accidentally rebooted most of the machines in a development cluster. It's a dev system, there's no SLA, on the whole I don't feel horrid, but it definitely ruined a few people's work for an hour. This morning I spent a few minutes putting in a guard rail to prevent that particular mistake again...

If you're in this job long enough, everyone breaks things -- it just happens.