| The Google SRE book offers the following as one of the reasons to not gun for 100% reliability (emphasis added): > users typically don’t notice the difference between high reliability and extreme reliability in a service, because the user experience is dominated by less reliable components like the cellular network or the device they are working with. Put simply, a user on a 99% reliable smartphone cannot tell the difference between 99.99% and 99.999% service reliability! I've been on a shaky relationship with my ISP of late. What brought me to this thread today is that I couldn't push to Github. Notably this isn't covered by their downtime report so, going by the available facts, it's _probably_ not Github's fault I couldn't push; and I've just been on my daily stand-up call and I got disconnected so frequently. But looking beyond today's available facts, odds are there's a bigger problem GH is not mentioning in their status page. They say the current incident has to do with "unauthorized users" and I wonder if pushing a commit from my IDE client counts as an operation from an "unauthorized user" as I still have to authorize with my SSH key. It's just insane I can't decide which between Github or German o2 should be the more reliable service! |
I think there's 3 big themes with this, thought not
1. LLM tools have added considerable load.
2. LLM used by developers to increase velocity seem to be leading more outages. This calls into question the increased velocity.
3. Roadmaps focused on pushing features that aren't reliability problems. i.e. github moving to azure, or adding AI features.
All these same problems happen to orgs with other fads that aren't AI. Following fads is not good engineering.