| Exactly. I had a team lead who started working for me a while back. We had scripts that could be run on the web farm to perform different tasks. Two of them relevant to this story were: webservers.regenerate.all.cache.files
webservers.release-prep.stop.all.services
The first one would refresh all the cached information after a marketing database update. The second would stop all the webservers.Guy's first day; I'm showing him the ropes; we push the marketing data update and set about regenerating all the cache files by manually picking the correct file from the folder of all possible files. I'm sure we can all guess what happened to make this a story remotely worth telling... Complete site outage. Completely unnecessary. Completely human error. Should we blame the guy who clicked on the file that was directly adjacent to the one he intended? Should we blame me as the guy overseeing the training? Or should we change the system so that files that we use multiple times everyday and are safe/innocuous are't right next to an E-stop/EPO button? Or maybe we should change the system so that pushing marketing data refreshes the caches files automatically? Blameless culture favors the latter actions over the former and tends to make your operation stronger and more resilient over time. The experts (and the novices) who made the mistake can speak freely about what happened and how we might prevent it, without fearing reprisal. If someone repeatedly kills the site by mistake time after time, despite reasonable safeguards being in place, they should face disciplinary action. But when they make an honest mistake because we left an idling chainsaw laying around on the workbench, it makes no sense to blame them for grabbing it by mistake. |