Hacker News new | ask | show | jobs
by sgarland 6 hours ago
You know hardware failures can occur, right?

I worked as an SRE at a well-known monitoring company that used a similar architecture. It worked extremely well, and aside from one software SPOF (which still had a blast radius limited to that cell), we had very few large-scale production incidents compared to everywhere else I’ve worked at.

Even if there was a physical hardware failure (at the time, it ran on-prem, but it’s not like AWS is immune to this), every service modulo the aforementioned SPOF had redundancy, so we would have the datacenter techs replace that blade, which would provision itself and rejoin, zero downtime, just a temporary loss of redundancy. Even then, if we felt it necessary, we could shift customers into a different cell, though that did cause a brief outage for them, which would be coordinated ahead of time.