|
|
|
|
|
by chmod775
1 day ago
|
|
As described, if one "cell" crashes, you re-route/retry everything on the other cell. Assuming your system is deterministic and the inputs stay the same, the second cell should now break as well. This reads to me like an attempt to patch a system that's already fucked beyond belief while pretending you're doing "engineering". Fancy implementation of a retry loop attempting to minimize downtime. |
|
I worked as an SRE at a well-known monitoring company that used a similar architecture. It worked extremely well, and aside from one software SPOF (which still had a blast radius limited to that cell), we had very few large-scale production incidents compared to everywhere else I’ve worked at.
Even if there was a physical hardware failure (at the time, it ran on-prem, but it’s not like AWS is immune to this), every service modulo the aforementioned SPOF had redundancy, so we would have the datacenter techs replace that blade, which would provision itself and rejoin, zero downtime, just a temporary loss of redundancy. Even then, if we felt it necessary, we could shift customers into a different cell, though that did cause a brief outage for them, which would be coordinated ahead of time.