|
It's a lot of good advice, but IME the next step is "but how do I actually do this?" A lot of the difficultly boils down to an inverse NIH syndrome: we outsource monitoring and alerting … and the systems out there are quite frankly pretty terrible. We struggle with alert routing, because alert routing should really take a function that takes alert data in and figures out what to do with it … but Pagerduty doesn't support that. Datadog (monitoring) struggles (struggles) with sane units, and IME with aliasing. DD will also alert on things that … don't match the alert criteria? (We've still not figured that one out.) “Aviate, Navigate, Communicate” definitely is a good idea, but let me know if you figure out how to teach people to communicate. Many of my coworkers lack basic Internet etiquette. (And I'm pretty sure "netiquette" died a long time ago.) The Swiss Cheese model isn't just about having layers to prevent failures. The inverse axiom is where the fun starts: the only failures you see, by definition, are the ones that go through all the holes in the cheese simultaneously. If they didn't, then by definition, a layer of swiss has stopped the outage. That means "how can this be? like n different things would have to be going wrong, all at the same time" isn't really an out in an outage: yes, by definition! This is too, of course, assuming you know what holes are in your cheese, and often, the cheese is much holier than people seem to think it is. I'm always going to hard disagree with runbooks, though. Most failures are of the "it's a bug" variety: there is no possible way to write the runbook for them. If you can write a runbook, that means you're aware of the bug: fix the bug, instead. The rest is bugs you're unaware of, and to write a runbook would thus require clairvoyance. (There are limited exceptions to this: sometimes you cannot fix the bug: e.g., if the bug lies in a vendor's software and the vendor refuses to do anything about it¹, then you're just screwed, and have to write down the next best work around, particularly if any workaround is hard to automate. There are other pressures, like PMs who don't give devs the time to fix bugs, but in general runbooks are a drag on productivity, as they're manual processes you're following in lieu of a working system. Be pragmatic about when you take them on (if you can). > Have a “Ubiquitous language” This one, this one is the real gem. I beg of you, please, do this. A solid ontology prevents bugs. This gets back to the "teach communication" problem, though. I work with devs who seem to derive pleasure from inventing new terms to describe things that already have terms. Communicating with them is a never ending game of grabbing my crystal ball and decoding WTF it is they're talking about. Also, I know the NATO alphabet (I'm not military/aviation). It is incredibly useful, and takes like 20-40 minutes of attempting to memorize it to get it. It is mind boggling that customer support reps do not learn this, given how shallow the barrier to entry is. (They could probably get away with like, 20 minutes of memorization & then learn the rest just via sink-or-swim.) (I also have what I call malicious-NATO: "C, as in sea", "Q, as in cue", "I, as in eye", "R, as in are", U, as in "you", "Y, as in why") > Don’t write code when you are tired. Yeah, don't: https://www.cdc.gov/niosh/emres/longhourstraining/impaired.h... And yet I regularly encounter orgs or people suggesting that deployments should occur well past the 0.05% BAC equivalent mark. "Unlimited PTO" … until everyone inevitably desires Christmas off and then push comes to shove. Some of this intertwines with common PM failure modes, too: I have, any number of times, been pressed for time estimates on projects where we don't have a good time estimate because there are two many unknowns in the project. (Typically because whomever is PM … really hasn't done their job in the first place of having even the foggiest understanding of what's actually involved, inevitably because the PM is non-technical. Having seen a computer is not technical.) When the work is then broken out and estimates assigned to the broken out form, the total estimate is rejected, because PMs/management don't like the number. Then inevitably a date is chosen at random by management. (And the number of times I've had a Saturday chosen is absurd, too.) And then the deadline is missed. Sometimes, projects skip right to the arbitrary deadline step, which at least cuts out some pointless debate about, yes, what you're proposing really is that complicated. That's stressful, PMs. ¹ cough Azure cough excuse me. |