Hacker News new | ask | show | jobs
by thehappyfellow 672 days ago
A frequent form of lack of tolerance for risk I’ve seen is not being able to make a speed vs quality trade-off. One example:

We were rolling out a change that had a small risk that we’ll have to manually reboot a couple of machines. The total disruption to business would’ve been less than $10k for sure. I had to fight people who wanted to spend 3 months writing a one-ff tooling lowering the chance of it happening. Madness!

2 comments

Fairly common, especially with mission-critical infrastructure like databases. There's an implicit assumption, by engineers and managers alike, that 100% uptime is the gold standard and anything less is a failure. It takes a rational engineer (in the context of this discussion, usually a "senior") to point out that a) SLAs never promise 100%, b) the rest of the infrastructure that comprises the system has only a few nines of availability anyways, c) the engineering cost of getting from 99.99% to 100% is orders of magnitude higher than getting to 99.999%. In other words: senior engineers should be able to contextualize engineering work and do tradeoff analysis; they provide value not by doing more work but by skipping the expensive, low-impact work.
$10k in lost sales/product during the downtime or $10k + the cost of IT to stand things back up, verify, resync + cost of other departments manually fixing other adjacent things that broke?

People who don't work daily in infra tend to not understand that downtime like that can have massive ripple effects. That one server, unknown to you, might have tentacles that reach all over the company. It might generate 100 tickets that now need to be verified by various IT personnel over the next few days in addition to their likely already full workload. It might have fucked up backups, DFS, patching cadence etc etc.

Sure, the approach I advocated for can have much worse consequences in general. However, in this particular case it was ~impossible for the outage to get that costly - we operated the servers and knew the blast radius. My estimate was for the total cost.

Also, 2 engineers working for 3 months cost a ton of money, not even counting for the opportunity cost of other things they could’ve been doing. If the potential outage cost was closer to $100k I’d likely stick with my decision.